<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: kubernetes</title>
    <description>The latest articles tagged 'kubernetes' on DEV Community.</description>
    <link>https://dev.to/t/kubernetes</link>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tag/kubernetes"/>
    <language>en</language>
    <item>
      <title>CloudNativePG: Running PostgreSQL in Kubernetes Without the Pain</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Tue, 16 Jun 2026 00:15:32 +0000</pubDate>
      <link>https://dev.to/futhgar/cloudnativepg-running-postgresql-in-kubernetes-without-the-pain-32pj</link>
      <guid>https://dev.to/futhgar/cloudnativepg-running-postgresql-in-kubernetes-without-the-pain-32pj</guid>
      <description>&lt;p&gt;A CloudNativePG cluster that sits in &lt;code&gt;Setting up primary&lt;/code&gt; forever, with zero error events on the Cluster resource and a perfectly healthy operator, is one of the more frustrating ways to spend an afternoon. The operator says it's working. The pods never appear. And the actual cause has nothing to do with the database at all.&lt;/p&gt;

&lt;p&gt;Running stateful databases on Kubernetes used to be the thing everyone told you not to do. CloudNativePG (CNPG) changed that calculus for a lot of people, including me. It's a proper operator: it handles failover, backups, connection routing, and rolling upgrades through native Kubernetes primitives instead of bolting Postgres onto a StatefulSet and praying. If you run a hardened cluster with admission controllers, network policies, and least-privilege RBAC, this post is about the friction you'll hit that the quickstart never mentions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who should care
&lt;/h2&gt;

&lt;p&gt;If your cluster is vanilla, &lt;code&gt;kubectl apply&lt;/code&gt; the operator and a &lt;code&gt;Cluster&lt;/code&gt; manifest, and you're done in ten minutes. The CNPG docs are genuinely good for that path. This is for the rest of us: people running Kyverno or OPA Gatekeeper, self-signed cert chains, and the kind of policy-as-code setup where every workload has to justify its existence. That's where CNPG stops being a ten-minute install and starts being an integration project.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I tried first
&lt;/h2&gt;

&lt;p&gt;The first instinct, when a CNPG cluster hangs, is to assume you got the database config wrong. So you go read your &lt;code&gt;Cluster&lt;/code&gt; manifest line by line. You check the storage class. You check that the PVC bound. You bump the operator log level and watch it cheerfully report that it's reconciling, over and over, with no complaints.&lt;/p&gt;

&lt;p&gt;Here's the trap: the CNPG operator doesn't run &lt;code&gt;initdb&lt;/code&gt; itself. It creates a Kubernetes &lt;strong&gt;Job&lt;/strong&gt; to bootstrap the primary. That Job spawns a Pod. And in a hardened cluster, the Pod is where everything dies, because your admission controller is judging it against policies the operator's own Pods were exempted from but the bootstrap Job was not.&lt;/p&gt;

&lt;p&gt;The mistake I see constantly is reading the wrong resource. People &lt;code&gt;kubectl describe cluster&lt;/code&gt; and &lt;code&gt;kubectl describe pod&lt;/code&gt; on the operator, find nothing, and conclude CNPG is broken. The events you need are on the &lt;strong&gt;Job&lt;/strong&gt; and on the Pod the Job tries to create. A blocked Pod creation shows up as an event on the Job's owning controller, not on the Cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# The Cluster looks stuck here, but says nothing useful&lt;/span&gt;
kubectl get cluster &lt;span class="nt"&gt;-n&lt;/span&gt; databases
&lt;span class="c"&gt;# NAME       AGE   INSTANCES   READY   STATUS                    PRIMARY&lt;/span&gt;
&lt;span class="c"&gt;# pg-main    8m    3           0       Setting up primary&lt;/span&gt;

&lt;span class="c"&gt;# The real story is on the bootstrap Job's events&lt;/span&gt;
kubectl describe job &lt;span class="nt"&gt;-n&lt;/span&gt; databases pg-main-1-initdb


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a policy is the culprit, that describe output is where you'll finally see something like &lt;code&gt;admission webhook "validate.kyverno.svc" denied the request: validation error: every container must define resource limits&lt;/code&gt;. The bootstrap Job's Pod template didn't set CPU/memory limits, your &lt;code&gt;require-resource-limits&lt;/code&gt; policy rejected it, and the operator quietly retries forever because, from its perspective, it asked Kubernetes nicely and Kubernetes said no.&lt;/p&gt;

&lt;p&gt;I spent longer than I'd like to admit assuming the storage layer was at fault before I went and looked at the Job. The lesson stuck: when an operator hangs, find the resource the operator &lt;em&gt;creates&lt;/em&gt;, not the resource it &lt;em&gt;manages&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The actual solution
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Exempt CNPG lifecycle resources from blocking policies
&lt;/h3&gt;

&lt;p&gt;CNPG generates Jobs and Pods on your behalf, and you can't directly edit their pod templates the way you would a Deployment you wrote. So the fix isn't to add resource limits to the Job. It's to teach your policy engine that CNPG-owned resources are allowed to skip the rule that's blocking them.&lt;/p&gt;

&lt;p&gt;Every resource CNPG creates carries the &lt;code&gt;cnpg.io/cluster&lt;/code&gt; label. That label is your exclusion key. For Kyverno, add an &lt;code&gt;exclude&lt;/code&gt; block to the rule that's firing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kyverno.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;require-resource-limits&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;validationFailureAction&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Enforce&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;validate-resources&lt;/span&gt;
      &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;any&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;kinds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pod"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exclude&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;any&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="c1"&gt;# CNPG-managed Pods (instances + bootstrap Jobs) carry this label&lt;/span&gt;
              &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="na"&gt;cnpg.io/cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*"&lt;/span&gt;
      &lt;span class="na"&gt;validate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Every&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;container&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;must&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;define&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;CPU&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;memory&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;limits."&lt;/span&gt;
        &lt;span class="na"&gt;pattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?*"&lt;/span&gt;
                    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?*"&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a deliberately narrow exclusion. You're not disabling the policy. You're carving out resources that match a specific operator-owned label, which means a developer can't accidentally smuggle a limitless Pod past the gate by slapping a random label on it. If you want to be stricter, scope the exclusion to the &lt;code&gt;databases&lt;/code&gt; namespace as well so the label only grants an exemption where CNPG is actually allowed to run.&lt;/p&gt;

&lt;p&gt;The same idea applies to OPA Gatekeeper, just expressed differently: add the label to the constraint's &lt;code&gt;match.excludedNamespaces&lt;/code&gt; or write a &lt;code&gt;labelSelector&lt;/code&gt; exclusion in the constraint spec. The principle doesn't change. Match the operator's label, exempt the lifecycle resources, leave everything else under enforcement. I wrote about the general shape of this in &lt;a href="https://dev.to/posts/kyverno-admission-controllers-policy-as-code-that-actually-works/"&gt;Kyverno Admission Controllers: Policy-as-Code That Actually Works&lt;/a&gt;, and CNPG's &lt;code&gt;initdb&lt;/code&gt; Job is the cleanest real-world example I've found of policy breaking infrastructure in a way that's invisible until you know where to look.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Give the operator the RBAC it actually needs
&lt;/h3&gt;

&lt;p&gt;If you provision service accounts by hand instead of trusting the operator's defaults, remember that CNPG needs to manage Jobs, Pods, PVCs, Secrets, and Services on your behalf. A read-only or overly-scoped account will fail in the same silent way a policy block does: the reconcile loop runs, the create call gets a &lt;code&gt;403&lt;/code&gt;, and nothing visible happens.&lt;/p&gt;

&lt;p&gt;The operator's ClusterRole covers this out of the box. If you're tightening it, the non-obvious permissions are the ability to create and delete Jobs (for &lt;code&gt;initdb&lt;/code&gt; and restores) and to manage PVCs (for volume expansion and replica provisioning). Strip those and your cluster bootstraps fine until the first time it needs to scale or recover, then breaks. I go deeper on scoping accounts like this in &lt;a href="https://dev.to/posts/kubernetes-rbac-building-least-privilege-service-accounts/"&gt;Kubernetes RBAC: Building Least-Privilege Service Accounts&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Pin your PostgreSQL minor version away from 16.4
&lt;/h3&gt;

&lt;p&gt;There's a known regression in PostgreSQL 16.4 where the server can hit a segmentation fault under certain memory conditions on nodes with large amounts of RAM available. If you're running CNPG on beefy worker nodes (16GB+ of available memory is the trigger zone), this is exactly the kind of thing that looks like a CNPG bug, a storage bug, or a kernel OOM, when it's actually upstream Postgres.&lt;/p&gt;

&lt;p&gt;The fix is boring and effective: pin the image to a known-good minor and don't float the tag.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgresql.cnpg.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Cluster&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pg-main&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;databases&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;instances&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="c1"&gt;# Pin explicitly. Do not use a floating major-version tag in production.&lt;/span&gt;
  &lt;span class="na"&gt;imageName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/cloudnative-pg/postgresql:16.6&lt;/span&gt;
  &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;20Gi&lt;/span&gt;
    &lt;span class="na"&gt;storageClass&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;longhorn&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2Gi"&lt;/span&gt;
      &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;500m"&lt;/span&gt;
    &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2Gi"&lt;/span&gt;
      &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note the memory &lt;code&gt;requests&lt;/code&gt; and &lt;code&gt;limits&lt;/code&gt; are set to the same value. For a database, you almost never want Postgres getting throttled or evicted because a noisy neighbor ballooned and the scheduler decided your &lt;code&gt;requests&lt;/code&gt; were a polite suggestion. Equal requests and limits put the Pod in the Guaranteed QoS class, which is what you want for a stateful workload you can't afford to lose to memory pressure.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Understand the three Services CNPG hands you
&lt;/h3&gt;

&lt;p&gt;This is the part that pays off long after install. For a cluster named &lt;code&gt;pg-main&lt;/code&gt;, CNPG creates a set of Services automatically, and each one routes to a different role:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Routes to&lt;/th&gt;
&lt;th&gt;Use it for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pg-main-rw&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Current primary&lt;/td&gt;
&lt;td&gt;Writes, migrations, anything that mutates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pg-main-ro&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Replicas only&lt;/td&gt;
&lt;td&gt;Read-only queries, reporting, analytics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pg-main-r&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Any instance (primary or replica)&lt;/td&gt;
&lt;td&gt;Reads where you don't care which node&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;-rw&lt;/code&gt; Service is the important one: when CNPG fails over, it repoints &lt;code&gt;-rw&lt;/code&gt; at the new primary. Your application doesn't need to know a failover happened. It keeps connecting to &lt;code&gt;pg-main-rw.databases.svc.cluster.local&lt;/code&gt; and the operator handles the rest. That's the entire value proposition of running Postgres under an operator instead of as a hand-rolled StatefulSet.&lt;/p&gt;

&lt;p&gt;For read/write splitting, point your app at two connection strings instead of one. Most ORMs and connection libraries support a primary/replica config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# In your app's config or Secret&lt;/span&gt;
&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DATABASE_URL_PRIMARY&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postgresql://app:$(PGPASSWORD)@pg-main-rw.databases.svc.cluster.local:5432/appdb"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DATABASE_URL_REPLICA&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postgresql://app:$(PGPASSWORD)@pg-main-ro.databases.svc.cluster.local:5432/appdb"&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Send &lt;code&gt;SELECT&lt;/code&gt;s that tolerate slight replication lag to &lt;code&gt;-ro&lt;/code&gt;, and send everything else to &lt;code&gt;-rw&lt;/code&gt;. The catch worth stating plainly: replicas are asynchronous by default, so a read immediately after a write can return stale data. If you need read-your-writes consistency for a given query, send it to &lt;code&gt;-rw&lt;/code&gt;. Don't blanket-route all reads to replicas and then act surprised when a user doesn't see the row they just created.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Connection SSL: the untrusted-certificate wall
&lt;/h3&gt;

&lt;p&gt;CNPG enables TLS by default and issues its own certificates through an internal CA. That's good for in-cluster security and annoying the first time a client refuses to connect because it doesn't trust the CA.&lt;/p&gt;

&lt;p&gt;The error you'll see from a client is some flavor of &lt;code&gt;SSL error: certificate verify failed&lt;/code&gt; or &lt;code&gt;self-signed certificate in certificate chain&lt;/code&gt;. The wrong reaction is to globally disable TLS on the cluster. The right reaction depends on who's connecting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# In-cluster clients: trust CNPG's CA. The operator publishes it as a Secret.&lt;/span&gt;
kubectl get secret pg-main-ca &lt;span class="nt"&gt;-n&lt;/span&gt; databases &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{.data.ca\.crt}'&lt;/span&gt; | &lt;span class="nb"&gt;base64&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; ca.crt
&lt;span class="c"&gt;# Then point the client at it:&lt;/span&gt;
&lt;span class="c"&gt;# postgresql://...?sslmode=verify-full&amp;amp;sslrootcert=/etc/pg/ca.crt&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For clients that genuinely can't do certificate verification (some managed platforms and serverless backends only support a binary "SSL on/off" toggle and can't be handed a custom CA), you have two honest options. Either set &lt;code&gt;sslmode=require&lt;/code&gt; on the client, which encrypts the connection but skips CA verification, or terminate trust at a proxy you control. &lt;code&gt;sslmode=require&lt;/code&gt; is the pragmatic middle ground: you keep encryption in transit and drop only the identity check. It's not as strong as &lt;code&gt;verify-full&lt;/code&gt;, but it's a deliberate, documented tradeoff rather than turning TLS off entirely.&lt;/p&gt;

&lt;p&gt;Here's the quick reference I keep around for the &lt;code&gt;sslmode&lt;/code&gt; ladder:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;code&gt;sslmode&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;Encrypted?&lt;/th&gt;
&lt;th&gt;Verifies CA?&lt;/th&gt;
&lt;th&gt;Verifies hostname?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;disable&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;require&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;verify-ca&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;verify-full&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Aim for &lt;code&gt;verify-full&lt;/code&gt; for anything in-cluster, where you control the CA distribution. Drop to &lt;code&gt;require&lt;/code&gt; only for external clients that can't be handed the CA, and never to &lt;code&gt;disable&lt;/code&gt;. If you're already running cluster-wide TLS automation, the CA-distribution problem is the same one cert-manager solves for ingress; I covered that workflow in &lt;a href="https://dev.to/posts/cert-manager-cloudflare-dns-01-automated-tls-for-everything/"&gt;cert-manager + Cloudflare DNS-01: Automated TLS for Everything&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Exposing pgAdmin without poking a hole in the cluster
&lt;/h3&gt;

&lt;p&gt;You'll eventually want a GUI to poke at the database. The pattern I'd reach for is pgAdmin4 in its own namespace, reachable through your existing ingress controller, never exposed directly. Keep it in a separate namespace from the database so your network policies can treat it as an external-ish client that's explicitly allowed to reach the &lt;code&gt;-rw&lt;/code&gt;/&lt;code&gt;-ro&lt;/code&gt; Services, rather than something that lives inside the data tier.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pgadmin&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pgadmin&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Force HTTPS and lean on cert-manager for the cert&lt;/span&gt;
    &lt;span class="na"&gt;cert-manager.io/cluster-issuer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;letsencrypt-prod&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/ssl-redirect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
    &lt;span class="c1"&gt;# pgAdmin needs a bigger body size for imports/exports&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/proxy-body-size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16m"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ingressClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx&lt;/span&gt;
  &lt;span class="na"&gt;tls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pgadmin.example.com"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;secretName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pgadmin-tls&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pgadmin.example.com&lt;/span&gt;
      &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/&lt;/span&gt;
            &lt;span class="na"&gt;pathType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prefix&lt;/span&gt;
            &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pgadmin&lt;/span&gt;
                &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Put authentication in front of it. pgAdmin's own login is fine, but I'd add an ingress-level auth layer (OAuth proxy or basic auth) so a leaked pgAdmin password isn't a direct line to your database. And lock down the NetworkPolicy so only the pgAdmin namespace can reach the database Services. A database admin GUI on the public internet with default credentials is how clusters become someone else's crypto miner.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it works
&lt;/h2&gt;

&lt;p&gt;The thing that finally made CNPG click for me is that it's not pretending Postgres is stateless. It embraces the fact that a database has a primary and replicas, that failover is a real event, and that bootstrapping is a one-time Job rather than a steady-state process. Every piece of the design maps a Postgres concept onto a native Kubernetes object you can inspect with &lt;code&gt;kubectl&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That's also why the failure modes are sneaky. The operator delegates the actual work to Jobs and Pods, so when an admission controller or RBAC rule blocks one of those, the operator has no good way to surface it beyond a stalled status. There's no exception thrown into your terminal. The reconcile loop is doing exactly what it's designed to do, which is keep trying, and "keep trying against a wall" looks identical to "working" until you go read the Job's events.&lt;/p&gt;

&lt;p&gt;The Service abstraction works because CNPG owns the failover decision and the endpoint update atomically. When it promotes a replica, it updates the &lt;code&gt;-rw&lt;/code&gt; Service's selector in the same control loop. There's no DNS TTL to wait out, no client-side failover logic to get wrong, no floating VIP to manage. Kubernetes Service routing was already solving "send traffic to whichever Pod currently has this role," and CNPG just plugs the primary/replica roles into that existing machinery. Running databases reliably on Kubernetes is the kind of platform-engineering work that separates a homelab toy from production infrastructure, and it's a chunk of what I do in &lt;a href="https://guatulabs.com/services" rel="noopener noreferrer"&gt;consulting engagements&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons learned
&lt;/h2&gt;

&lt;p&gt;The biggest shift was learning to debug the resources the operator creates, not the ones it manages. &lt;code&gt;kubectl describe cluster&lt;/code&gt; will lie to you by omission. The Job and its Pod tell the truth. If a CNPG cluster hangs in &lt;code&gt;Setting up primary&lt;/code&gt;, my first move now is straight to the bootstrap Job's events, and nine times out of ten it's a policy or RBAC denial, not a database problem.&lt;/p&gt;

&lt;p&gt;What surprised me was how much the hardened-cluster setup matters. Every CNPG tutorial assumes a permissive cluster, so the exact features that make a cluster production-grade (enforced resource limits, least-privilege RBAC, default-deny network policies) are the features that break the install. None of them are CNPG's fault. They're the cost of doing security right, and the fix is always a narrow, labeled exclusion rather than a blanket exception. If you run CNPG via GitOps, put those policy exclusions in the same ArgoCD app as the operator so they're never out of sync; the &lt;a href="https://dev.to/posts/gitops-for-homelabs-argocd-app-of-apps/"&gt;App-of-Apps pattern&lt;/a&gt; handles this cleanly.&lt;/p&gt;

&lt;p&gt;If I were starting over, I'd pin the PostgreSQL minor version from day one and treat floating tags as a production smell, set Guaranteed QoS on the database Pods before the first incident rather than after, and write the read/write split into the application from the start instead of routing everything at the primary and refactoring later. None of those are hard. They're just the kind of decision that's cheap to make early and expensive to retrofit once you have data and uptime to protect.&lt;/p&gt;

&lt;p&gt;CNPG genuinely delivers on running Postgres in Kubernetes without the pain, but only if you account for the cluster you actually have, not the empty one the docs assume. The operator is excellent. The integration with your security posture is the part you own.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>postgres</category>
      <category>cloudnativepg</category>
      <category>database</category>
    </item>
    <item>
      <title>From Pull Request to Preview Environment: How We Built Disposable Cloud Environments at RTL</title>
      <dc:creator>Rajesh Gunasekaran</dc:creator>
      <pubDate>Mon, 15 Jun 2026 21:36:40 +0000</pubDate>
      <link>https://dev.to/rgunasekaran/from-pull-request-to-preview-environment-how-we-built-disposable-cloud-environments-at-rtl-2apj</link>
      <guid>https://dev.to/rgunasekaran/from-pull-request-to-preview-environment-how-we-built-disposable-cloud-environments-at-rtl-2apj</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;The same developer conversation kept coming up on the platform team: &lt;em&gt;"I can't really verify this Pull Request (PR) without merging it."&lt;/em&gt; Local dev handled simple cases, but the moment a change touched a real cloud resource (a database, a queue, an object store, an external API) the only honest test was to merge and watch.&lt;/p&gt;

&lt;p&gt;This post is how we replaced that with &lt;strong&gt;on-demand ephemeral environments, one per PR&lt;/strong&gt;: fully provisioned from cloud resources down to running Pods, and torn down automatically when the PR closes or merges. It's built on two open-source tools that compose well: &lt;strong&gt;ArgoCD's ApplicationSet PR generator&lt;/strong&gt; and &lt;strong&gt;Crossplane&lt;/strong&gt;. &lt;/p&gt;

&lt;h2&gt;
  
  
  Why per-PR environments?
&lt;/h2&gt;

&lt;p&gt;Shared "dev" or "staging" environments fail in two ways:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Contention&lt;/strong&gt;: two PRs touch the same shared queue or schema, and neither developer can tell whose change broke what.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drift&lt;/strong&gt;: what you tested in staging isn't what reaches production, because ten more PRs landed before yours got promoted.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The fix: give every PR its own slice of the world. Open a PR with a &lt;code&gt;preview&lt;/code&gt; label, a fresh namespace appears, the PR's image is deployed, backing infra is provisioned, a preview URL goes live. Merge or close the PR, and the whole slice disappears.&lt;/p&gt;

&lt;p&gt;That sounds expensive. With Crossplane + ArgoCD it isn't, because everything (manifests &lt;em&gt;and&lt;/em&gt; cloud resources) lives behind the same GitOps loop. No separate Terraform pipeline. ArgoCD watches PRs; Crossplane watches Custom Resources; the rest follows.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture, end-to-end
&lt;/h2&gt;

&lt;p&gt;Three zones: Source Code Management (SCM), the Kubernetes cluster, and the cloud. Two signals connect them; the PR generator polling SCM, and Crossplane authenticating to the cloud via Workload Identity Federation (WIF).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm403ssc1zfi2u9ibrtzr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm403ssc1zfi2u9ibrtzr.png" alt=" " width="800" height="1260"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every arrow in this diagram is triggered by one event: a PR opening, updating, or closing in GitHub. Start there, and follow the arrows.&lt;/p&gt;

&lt;h2&gt;
  
  
  The building blocks
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ArgoCD ApplicationSet with PR generator&lt;/td&gt;
&lt;td&gt;Watches PRs in source repos. Emits one &lt;code&gt;Application&lt;/code&gt; per matching PR, parameterised by PR number / branch / labels.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Helm chart (tenant)&lt;/td&gt;
&lt;td&gt;Renders workload manifests for one PR: &lt;code&gt;Deployment&lt;/code&gt;, &lt;code&gt;Service&lt;/code&gt;, &lt;code&gt;Ingress&lt;/code&gt;, plus Crossplane &lt;code&gt;Claim&lt;/code&gt;s for backing infra.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Crossplane&lt;/td&gt;
&lt;td&gt;Reconciles Claims into real cloud resources via providers for your cloud.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kyverno&lt;/td&gt;
&lt;td&gt;Admission webhook that stamps every Crossplane Managed Resource at the point of creation, injects the correct &lt;code&gt;providerConfigRef&lt;/code&gt; and &lt;code&gt;resourceGroupName&lt;/code&gt; from namespace labels. Tenant chart authors never write (or hold) platform credentials.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Workload Identity Federation&lt;/td&gt;
&lt;td&gt;Lets the Crossplane provider authenticate to the cloud without a static credential. The provider's Kubernetes ServiceAccount is federated to a cloud Managed Identity via the cluster's OIDC issuer URL. No secret, no rotation, no leak surface.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;ArgoCD never talks to the cloud directly. It deploys the chart, the chart contains a &lt;code&gt;DatabaseClaim&lt;/code&gt;, Crossplane provisions the database. Both directions are GitOps-pure.&lt;/p&gt;

&lt;h2&gt;
  
  
  ApplicationSet + PR generator
&lt;/h2&gt;

&lt;p&gt;A sanitised tenant ApplicationSet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ApplicationSet&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;tenant-a-previews&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;argocd&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;goTemplate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;generators&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;pullRequest&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;github&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;owner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tenant-a&lt;/span&gt;
          &lt;span class="na"&gt;repo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tenant-a-app&lt;/span&gt;
          &lt;span class="na"&gt;tokenRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;secretName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;github-token-tenant-a&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;token&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
        &lt;span class="na"&gt;filters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[{&lt;/span&gt; &lt;span class="nv"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;preview"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt; &lt;span class="pi"&gt;}]&lt;/span&gt;
        &lt;span class="na"&gt;requeueAfterSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tenant-a-pr-{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;.number&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}'&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tenant-a&lt;/span&gt;
      &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/platform/tenant-charts.git&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;charts/tenant-a-preview&lt;/span&gt;
        &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;main&lt;/span&gt;
        &lt;span class="na"&gt;helm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;pr.number&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;.number&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}'&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;pr.branch&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;.branch&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}'&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
      &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;https&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;//kubernetes.default.svc&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tenant-a-pr-{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;.number&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}'&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
      &lt;span class="na"&gt;syncPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;automated&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;prune&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;true&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;selfHeal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;true&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
        &lt;span class="na"&gt;syncOptions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;CreateNamespace=true&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things worth noting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Two SCMs, one architecture.&lt;/strong&gt; The PR generator supports GitHub &lt;em&gt;and&lt;/em&gt; Azure DevOps (plus GitLab, Bitbucket). Tenant onboarding values declare &lt;code&gt;scm.provider: github | azuredevops&lt;/code&gt; and the rest is derived.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token storage.&lt;/strong&gt; The SCM token never lives in &lt;code&gt;values.yaml&lt;/code&gt;. We sync a Personal Access Token (PAT) from a cloud Key Vault into the cluster via a vault-sync operator, and the ApplicationSet's &lt;code&gt;tokenRef&lt;/code&gt; points at the resulting Kubernetes Secret.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;requeueAfterSeconds&lt;/code&gt;.&lt;/strong&gt; Default is 30 minutes, unusable for developer experience. Drop it to 60 seconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Filter on a label, not every PR.&lt;/strong&gt; Tagging the PR &lt;code&gt;preview&lt;/code&gt; keeps the noise down for repos where most PRs don't need a preview.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Crossplane — the cloud resources half
&lt;/h2&gt;

&lt;p&gt;ArgoCD gets the Pod running. The Pod usually wants a database, a queue, an object store. Without Crossplane, your options are: (a) one giant shared DB with per-PR schemas (contention, drift, awful teardown), or (b) a separate Terraform pipeline per PR (slow, asynchronous, two tools to reason about).&lt;/p&gt;

&lt;p&gt;Crossplane gives you a third: &lt;strong&gt;the cloud resource is a Kubernetes object&lt;/strong&gt;. The tenant chart contains:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;db.platform.example.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PostgresClaim&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;pr-&lt;/span&gt;&lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;.Values.pr.number&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt;-db&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;compositionRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;postgres-small&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
  &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;storageGB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;10&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16"&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
  &lt;span class="na"&gt;writeConnectionSecretToRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;db-connection&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A platform-owned &lt;code&gt;Composition&lt;/code&gt; translates that into real cloud resources via Crossplane's provider for your cloud. The connection Secret lands in the PR's namespace; the &lt;code&gt;Deployment&lt;/code&gt; mounts it.&lt;/p&gt;

&lt;p&gt;Close the PR → ArgoCD prunes the Application → the Claim disappears → &lt;strong&gt;Crossplane garbage-collects the cloud resource&lt;/strong&gt;. No orphaned databases.&lt;/p&gt;

&lt;p&gt;Habits we picked up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Compositions are platform-owned, Claims are tenant-owned.&lt;/strong&gt; Tenants pick from a curated menu (&lt;code&gt;postgres-small&lt;/code&gt;, &lt;code&gt;queue-default&lt;/code&gt;, &lt;code&gt;storage-bucket&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set &lt;code&gt;deletionPolicy: Delete&lt;/code&gt; explicitly.&lt;/strong&gt; Some providers default to &lt;code&gt;Orphan&lt;/code&gt;, which is the wrong default for ephemeral envs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tag everything with the PR number&lt;/strong&gt; so every resource in the cloud console is traceable back to the PR that created it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The lifecycle of a preview PR
&lt;/h2&gt;

&lt;p&gt;Here's what happens when a developer opens a labelled PR, and what gets cleaned up when they close it:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Open.&lt;/strong&gt; A developer opens a PR and adds the &lt;code&gt;preview&lt;/code&gt; label. The ApplicationSet PR generator polls SCM every 60 seconds, sees the new labelled PR, and emits one &lt;code&gt;Application&lt;/code&gt; (&lt;code&gt;tenant-pr-N&lt;/code&gt;). That Application renders the tenant Helm chart at the PR's exact commit into a fresh namespace, covering both the workload manifests &lt;em&gt;and&lt;/em&gt; the Crossplane Claims.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Provision + serve.&lt;/strong&gt; Crossplane reconciles the Claims into real cloud resources and writes a connection Secret into the namespace. The Pod pulls the &lt;code&gt;pr-N&lt;/code&gt; image, mounts that Secret, and the preview URL goes live. (How each Claim is routed to the right tenant identity and Resource Group is the fence; see &lt;em&gt;Stamping the fence at admission&lt;/em&gt; below.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Iterate.&lt;/strong&gt; Every push updates the generated revision; ArgoCD re-syncs, so the preview always tracks the latest commit. No manual redeploy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Close.&lt;/strong&gt; Merge or close the PR and it drops off the generator's list on the next 60-second poll. ArgoCD removes the Application and prunes the namespace; the Claim disappears and Crossplane garbage-collects the cloud resource. The whole slice is gone, no orphaned infra to bill for.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnl3i11nowm1x26evk6q9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnl3i11nowm1x26evk6q9.png" alt=" " width="800" height="613"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The teardown has ordering subtleties (what must drain before what, and what happens when a finalizer gets stuck) but those are operational lessons, so they live in the companion post.&lt;/p&gt;

&lt;h2&gt;
  
  
  The multi-tenant pattern
&lt;/h2&gt;

&lt;p&gt;One &lt;strong&gt;tenant module&lt;/strong&gt; onboards a team in a couple of lines of values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;tenants&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tenant-a&lt;/span&gt;
    &lt;span class="na"&gt;scm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;github&lt;/span&gt;
      &lt;span class="na"&gt;github&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;owner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;tenant-a&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;repo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;tenant-a-app&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;keyvaultKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;tenant-a-pat&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
    &lt;span class="na"&gt;chart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;charts/tenant-a-preview&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
    &lt;span class="na"&gt;composition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres-small&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tenant-b&lt;/span&gt;
    &lt;span class="na"&gt;scm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;azuredevops&lt;/span&gt;
      &lt;span class="na"&gt;azuredevops&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;organization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;contoso&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;TenantB&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;repo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;tenant-b-app&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;keyvaultKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;tenant-b-pat&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
    &lt;span class="na"&gt;chart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;charts/tenant-b-preview&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
    &lt;span class="na"&gt;composition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;queue-default&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The module renders the per-tenant ApplicationSet, an &lt;code&gt;AppProject&lt;/code&gt; for isolation, the secret wiring, and default RBAC. Adding a new team is a five-line PR. Adoption becomes a conversation about &lt;em&gt;fit&lt;/em&gt;, not &lt;em&gt;effort&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stamping the fence at admission
&lt;/h2&gt;

&lt;p&gt;The values block above declares which Resource Group (RG) and which Managed Identity (MI) belong to a tenant. By itself, that fences nothing. A team's helm chart that writes its own &lt;code&gt;providerConfigRef&lt;/code&gt; could route a &lt;code&gt;PostgresClaim&lt;/code&gt; at another team's identity. A team's chart that picks any &lt;code&gt;resourceGroupName&lt;/code&gt; could pour resources into the wrong RG. &lt;em&gt;Values are not policy.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;What actually fences the tenant is a three-layer arrangement:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;ArgoCD owns the namespace labels.&lt;/strong&gt; Every per-PR namespace is born with &lt;code&gt;platform.example.io/tenant: &amp;lt;team&amp;gt;&lt;/code&gt;, &lt;code&gt;platform.example.io/tenant-rg: &amp;lt;team-rg&amp;gt;&lt;/code&gt;, and &lt;code&gt;platform.example.io/tenant-scope: pr-preview&lt;/code&gt;, set via &lt;code&gt;managedNamespaceMetadata&lt;/code&gt; on the per-tenant ApplicationSet. The team chart cannot edit them: ArgoCD reconciles them away on every sync.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kyverno stamps the Managed Resources at admission.&lt;/strong&gt; A &lt;code&gt;ClusterPolicy&lt;/code&gt; matches every Crossplane Managed Resource (MR) created in a &lt;code&gt;pr-preview&lt;/code&gt; namespace and &lt;em&gt;mutates&lt;/em&gt; the spec: it adds &lt;code&gt;providerConfigRef&lt;/code&gt; derived from the &lt;code&gt;tenant&lt;/code&gt; label, and &lt;code&gt;forProvider.resourceGroupName&lt;/code&gt; derived from &lt;code&gt;tenant-rg&lt;/code&gt;. Team chart authors never write either field. They cannot: Kyverno overrides whatever they put there.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The cloud enforces the blast radius.&lt;/strong&gt; The MI for the team has exactly one &lt;code&gt;RoleAssignment&lt;/code&gt;: &lt;code&gt;Contributor&lt;/code&gt;, scoped to the team's RG. Even if a tenant somehow got past layers 1 and 2, the cloud's own Role-Based Access Control (RBAC) would refuse the call.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Kyverno ClusterPolicy (abridged) — runs at admission on every Crossplane MR&lt;/span&gt;
&lt;span class="c1"&gt;# created in a tenant namespace.&lt;/span&gt;
&lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;any&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;kinds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*/*"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;namespaceSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;platform.example.io/tenant-scope&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pr-preview&lt;/span&gt;
&lt;span class="na"&gt;preconditions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;all&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;regex_match('^.*&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;.upbound&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;.io$',&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;request.kind.group&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}')&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}"&lt;/span&gt;
      &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Equals&lt;/span&gt;
      &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;mutate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;patchesJson6902&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
    &lt;span class="s"&gt;- op: add&lt;/span&gt;
      &lt;span class="s"&gt;path: /spec/providerConfigRef&lt;/span&gt;
      &lt;span class="s"&gt;value:&lt;/span&gt;
        &lt;span class="s"&gt;kind: ClusterProviderConfig&lt;/span&gt;
        &lt;span class="s"&gt;name: '{{ nsLabels."platform.example.io/tenant" }}-azure'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The shape &lt;code&gt;kinds: ["*/*"]&lt;/code&gt; is deliberate: Kubernetes admission webhooks reject partial group wildcards, so the group filter lives in &lt;code&gt;preconditions&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The pattern inverts the trust model: the &lt;em&gt;chart&lt;/em&gt; is the least-trusted artefact, the &lt;em&gt;namespace label&lt;/em&gt; is the trust anchor, and &lt;em&gt;Kyverno is the seam&lt;/em&gt;. Team chart authors get fast iteration without ever holding platform credentials. Platform owns the fence and can change it in one place when something needs hardening.&lt;/p&gt;

&lt;h2&gt;
  
  
  Authentication — no static cloud credentials
&lt;/h2&gt;

&lt;p&gt;Crossplane needs to authenticate to the cloud. We refused to put a static credential in a Kubernetes Secret. The pattern that worked:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;Managed Identity&lt;/strong&gt; (or its cloud equivalent) for the Crossplane provider.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workload Identity Federation&lt;/strong&gt;: the ServiceAccount the provider runs as is federated to that identity via the cluster's OpenID Connect (OIDC) issuer URL.&lt;/li&gt;
&lt;li&gt;The provider's &lt;code&gt;ProviderConfig&lt;/code&gt; references the identity by client ID. No secret, no rotation, no leak surface.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Azure: User-Assigned Managed Identity + FederatedIdentityCredential. AWS: IAM Roles for Service Accounts (IRSA). GCP: Workload Identity. Same principle, different flavour.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One honest caveat.&lt;/strong&gt; In a multi-tenant control plane like ours, &lt;em&gt;one&lt;/em&gt; Crossplane provider Pod talks to &lt;em&gt;many&lt;/em&gt; tenant identities. Its ServiceAccount mounts a single OIDC token, and the &lt;code&gt;subject&lt;/code&gt; claim in that token is shared across all tenants. Per-tenant isolation comes from the fence above: Kyverno picks the right &lt;code&gt;ProviderConfig&lt;/code&gt; per namespace, and the RG-scoped &lt;code&gt;RoleAssignment&lt;/code&gt; enforces blast radius at the cloud. It is &lt;em&gt;not&lt;/em&gt; cryptographic isolation. We made our peace with the trade-off; worth naming out loud.&lt;/p&gt;

&lt;p&gt;The SCM PAT is the one static credential we still need: the PR generator does not yet federate to SCMs natively. We rotate it on a schedule and hope upstream catches up.&lt;/p&gt;

&lt;h2&gt;
  
  
  The half of the loop that bit us
&lt;/h2&gt;

&lt;p&gt;The flow above assumes the image tagged &lt;code&gt;pr-&amp;lt;N&amp;gt;&lt;/code&gt; &lt;em&gt;exists&lt;/em&gt; when ArgoCD spawns the Application. That image comes from &lt;strong&gt;the tenant's own CI pipeline&lt;/strong&gt;, not the platform.&lt;/p&gt;

&lt;p&gt;We discovered this during demo validation: every platform piece worked beautifully, but the Pod hit &lt;code&gt;ErrImagePull&lt;/code&gt; because nothing had built &lt;code&gt;pr-42&lt;/code&gt;. The fix on the tenant side is small: a PR-triggered pipeline that builds and tags &lt;code&gt;pr-$(PR_NUMBER)&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;trigger&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;none&lt;/span&gt;
&lt;span class="na"&gt;pr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;include&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;*'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
&lt;span class="na"&gt;stages&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;BuildPushImage&lt;/span&gt;
    &lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;task&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Docker@2&lt;/span&gt;
            &lt;span class="na"&gt;inputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;buildAndPush&lt;/span&gt;
              &lt;span class="na"&gt;repository&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$(REGISTRY)/$(IMAGE)&lt;/span&gt;
              &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pr-$(System.PullRequest.PullRequestId)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The bigger lesson: &lt;strong&gt;write the platform/tenant contract down before onboarding tenant #2&lt;/strong&gt;. What the platform owns vs. what the tenant owns. Without it, the seam between "platform did its part" and "tenant did its part" becomes a debugging nightmare.&lt;/p&gt;

&lt;h2&gt;
  
  
  When the workaround became the bug
&lt;/h2&gt;

&lt;p&gt;A day after a mitigation landed, a wave of fresh-PR Pods started failing with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AADSTS700211: No matching federated identity record found for presented assertion issuer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;kubectl get fic -o yaml&lt;/code&gt; on the affected tenants showed exactly the OIDC issuer URL we expected in &lt;code&gt;spec.forProvider&lt;/code&gt;. So we asked Azure directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;az identity federated-credential show &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--identity-name&lt;/span&gt; mi-tenant-a &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--name&lt;/span&gt; mi-tenant-a-fed &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--resource-group&lt;/span&gt; rg-tenant-a &lt;span class="se"&gt;\&lt;/span&gt;
  | jq .issuer
&lt;span class="s2"&gt;"https://oidc.&amp;lt;old-cluster&amp;gt;.../..."&lt;/span&gt;   &lt;span class="c"&gt;# not what the spec said&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Azure was still serving the pre-migration issuer. The cluster had just migrated to a new OIDC issuer URL; the corrected URL was already in the values file, ArgoCD had synced the Composition, and the corrected spec had never reached Azure.&lt;/p&gt;

&lt;p&gt;To understand why, you have to know about the workaround already living in the FIC's Composition. A day earlier we had hit a different bug: legitimate FIC spec changes were being overwritten back to stale values within seconds of applying them. The symptom looked exactly like a stale-state cache in the upstream provider. The mitigation we shipped was minimal: strip &lt;code&gt;Update&lt;/code&gt; from &lt;code&gt;spec.managementPolicies&lt;/code&gt;, leaving &lt;code&gt;[Create, Delete, Observe]&lt;/code&gt;. Crossplane could still create and observe FICs but could no longer push corrections. The symptom went away. We marked it "interim, pending upstream fix" in the docs and moved on.&lt;/p&gt;

&lt;p&gt;What we did not realise: we had just disabled the only path through which &lt;em&gt;legitimate future spec changes&lt;/em&gt; could reach Azure. The OIDC issuer correction sat in git, reconciled, observed, and stranded.&lt;/p&gt;

&lt;p&gt;The real fix landed in three commits:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Find the second writer.&lt;/strong&gt; The earlier "stale writes" symptom was not a cache. Crossplane had been running on our previous cluster (call it cluster-A) while we migrated the platform to a new cluster (cluster-B). We updated the OIDC issuer URL in cluster-B's values file and replaced all cluster-A references, but forgot to disable Crossplane on cluster-A in the app-of-apps. Both clusters held valid Workload Identity Federation credentials scoped to the same Managed Identities. Both were reconciling the same Azure Federated Identity Credentials. When cluster-B wrote the corrected OIDC issuer URL, cluster-A's Crossplane saw drift from its own (stale) spec and immediately wrote it back. The two providers were racing over every field, continuously. Disabling Crossplane on cluster-A removed the second writer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Put &lt;code&gt;Update&lt;/code&gt; back.&lt;/strong&gt; With the real cause neutralised, restoring the default &lt;code&gt;managementPolicies&lt;/code&gt; (&lt;code&gt;Create, Update, Delete, Observe&lt;/code&gt;) is safe again.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recreate the affected composites.&lt;/strong&gt; Existing FICs in Azure still carried the wrong pre-migration issuer. &lt;code&gt;kubectl delete xteamidentityazure &amp;lt;tenant&amp;gt;&lt;/code&gt; for each affected tenant; ArgoCD re-renders the claim; the Composition re-emits FICs with the corrected spec; &lt;code&gt;UserAssignedIdentity&lt;/code&gt; adopts the existing Azure MI via pinned &lt;code&gt;external-name&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Three lessons we will not forget:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A mitigation that disables a reconcile loop is a time bomb for legitimate change.&lt;/strong&gt; Stripping &lt;code&gt;Update&lt;/code&gt; made the original symptom go away &lt;em&gt;and&lt;/em&gt; blocked the very next legitimate spec change from reaching the cloud, inside 24 hours. Workarounds shipped mid-migration are especially dangerous: that's exactly when legitimate spec changes are flying past, any one of which the mitigation could silently swallow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When migrating Crossplane to a new cluster, disable it on the old one first.&lt;/strong&gt; Two Crossplane providers authenticated to the same cloud identity will race over the same cloud resources. The symptom looks like a stale-state cache or a misbehaving provider, but it isn't. If both clusters hold valid Workload Identity Federation credentials scoped to the same Managed Identity, both reconcilers will fight over every field on every Azure resource. Disable (or fully remove) Crossplane on the source cluster before the destination cluster takes over. App-of-apps templating makes this easy to miss: the old cluster's entry is still there, still healthy, still reconciling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When the Kubernetes side and the cloud side disagree, ask the cloud.&lt;/strong&gt; &lt;code&gt;kubectl get fic&lt;/code&gt; agreed with the corrected spec. Azure did not. &lt;code&gt;az identity federated-credential show&lt;/code&gt; was the call that broke the illusion. If the symptom screams &lt;em&gt;trust&lt;/em&gt;, don't trust what Kubernetes tells you about the cloud. Go to the cloud.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Lessons learned
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pick your secret-injection operator deliberately.&lt;/strong&gt; We migrated from the External Secrets Operator (ESO) to a cloud-native vault-sync operator (akv2k8s on Azure). The motivation was workload-identity hygiene: ESO needed its own cluster ServiceAccount with cloud-side credentials, while the cloud-native operator authenticates with the cluster's existing federated identity. One fewer credential, one fewer rotation. Either pattern works, but pick early; mid-stream migration is annoying.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tear-down ordering matters.&lt;/strong&gt; Set &lt;code&gt;PrunePropagationPolicy=foreground&lt;/code&gt; on the per-tenant Application's &lt;code&gt;syncOptions&lt;/code&gt;. That tells the Kubernetes garbage collector to wait for Crossplane finalizers to drain before deleting the namespace. Get this wrong and the namespace disappears first, orphaning any cloud resource whose Claim was still finalizing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cap the cost.&lt;/strong&gt; Set a TTL: auto-close PRs idle for N days, or scale preview Pods to zero overnight. Ephemeral envs are cheap individually and expensive in aggregate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validate end-to-end, on a real tenant repo, by opening a PR yourself.&lt;/strong&gt; "Platform delivered" is not the same as "the loop closes." The PR is the only acceptance test that matters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A live demo on a tenant's actual codebase converts more skeptics than any docs page.&lt;/strong&gt; Build one, run it, click the preview URL together.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One public feedback channel&lt;/strong&gt; (Slack thread or Confluence page) where every team's pain lives. Triage weekly. The platform stays honest.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Federated SCM auth&lt;/strong&gt;: kill the last static credential when upstream catches up.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-PR cost estimate&lt;/strong&gt; posted back to the PR as a comment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smoke test on spin-up&lt;/strong&gt;: &lt;code&gt;is the preview URL 200?&lt;/code&gt; posted back as a PR check. Half the time when a preview "doesn't work" it's the tenant's app, not the platform; making that visible saves the back-and-forth.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Closing thought
&lt;/h2&gt;

&lt;p&gt;The best part of this setup isn't the technology: it's that "did this PR work?" stops being a debate. The developer opens the PR, clicks the preview link, exercises the change end-to-end against real cloud resources, and either ships it or doesn't.&lt;/p&gt;

&lt;p&gt;If you're standing this up at your own org, start with &lt;strong&gt;one&lt;/strong&gt; tenant repo, &lt;strong&gt;one&lt;/strong&gt; type of backing resource, and a &lt;strong&gt;labelled PR&lt;/strong&gt; as the trigger. Get the loop closing end-to-end before you generalise. Everything else is iteration.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>argocd</category>
      <category>crossplane</category>
      <category>platformengineering</category>
    </item>
    <item>
      <title>The Ultimate Kubernetes Checklist: Unlocking Performance While Slashing Costs</title>
      <dc:creator>Nandini</dc:creator>
      <pubDate>Mon, 15 Jun 2026 16:57:38 +0000</pubDate>
      <link>https://dev.to/nandini_8kanaujiya/the-ultimate-kubernetes-checklist-unlocking-performance-while-slashing-costs-4cm7</link>
      <guid>https://dev.to/nandini_8kanaujiya/the-ultimate-kubernetes-checklist-unlocking-performance-while-slashing-costs-4cm7</guid>
      <description>&lt;ol&gt;
&lt;li&gt;Build a Strong Kubernetes Foundation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Before optimizing costs or performance, organizations must ensure their Kubernetes environment is healthy and reliable. A strong foundation reduces operational risks and creates a stable platform for future growth.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Key Areas to Review&lt;/li&gt;
&lt;li&gt;Control plane health monitoring&lt;/li&gt;
&lt;li&gt;Kubernetes version management&lt;/li&gt;
&lt;li&gt;Automated backup strategies&lt;/li&gt;
&lt;li&gt;Multi-zone deployments&lt;/li&gt;
&lt;li&gt;Node health monitoring
Why It Matters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many performance issues and unexpected outages originate from poor cluster maintenance. Establishing strong operational practices early prevents larger problems later and improves overall cluster reliability.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Optimize Resources and Scale Efficiently&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Resource waste is one of the biggest contributors to rising Kubernetes costs. Overprovisioned CPU and memory allocations often leave clusters running far below their actual capacity.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Resource Optimization Checklist&lt;/li&gt;
&lt;li&gt;Review CPU and memory requests&lt;/li&gt;
&lt;li&gt;Compare allocated resources with actual usage&lt;/li&gt;
&lt;li&gt;Remove unused workloads&lt;/li&gt;
&lt;li&gt;Optimize scheduled batch jobs&lt;/li&gt;
&lt;li&gt;Right-size applications regularly&lt;/li&gt;
&lt;li&gt;Smart Scaling Checklist&lt;/li&gt;
&lt;li&gt;Configure Horizontal Pod Autoscaler (HPA)&lt;/li&gt;
&lt;li&gt;Enable Cluster Autoscaler&lt;/li&gt;
&lt;li&gt;Test scaling thresholds&lt;/li&gt;
&lt;li&gt;Simulate peak traffic conditions&lt;/li&gt;
&lt;li&gt;Monitor node utilization
Why It Matters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Organizations frequently reduce cloud spending by 20–40% simply by right-sizing workloads and implementing effective autoscaling strategies.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Improve Visibility, Security, and Operational Control&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Optimization becomes difficult when teams lack visibility into resource consumption, application performance, and infrastructure costs.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Observability Checklist&lt;/li&gt;
&lt;li&gt;Enable metrics collection&lt;/li&gt;
&lt;li&gt;Configure centralized logging&lt;/li&gt;
&lt;li&gt;Implement distributed tracing&lt;/li&gt;
&lt;li&gt;Establish cost monitoring&lt;/li&gt;
&lt;li&gt;Create performance dashboards&lt;/li&gt;
&lt;li&gt;Security Checklist&lt;/li&gt;
&lt;li&gt;Enforce RBAC policies&lt;/li&gt;
&lt;li&gt;Configure Network Policies&lt;/li&gt;
&lt;li&gt;Scan container images&lt;/li&gt;
&lt;li&gt;Secure secrets management&lt;/li&gt;
&lt;li&gt;Enable audit logging
Why It Matters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Visibility helps teams identify inefficiencies, while security ensures workloads remain protected without sacrificing performance.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create a Culture of Continuous Optimization&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Kubernetes environments constantly evolve. New applications, changing traffic patterns, and increasing infrastructure demands can quickly introduce inefficiencies.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Continuous Improvement Checklist&lt;/li&gt;
&lt;li&gt;Monthly cost audits&lt;/li&gt;
&lt;li&gt;Quarterly architecture reviews&lt;/li&gt;
&lt;li&gt;Resource utilization analysis&lt;/li&gt;
&lt;li&gt;Performance benchmarking&lt;/li&gt;
&lt;li&gt;FinOps collaboration&lt;/li&gt;
&lt;li&gt;Kubernetes Maturity Scorecard&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Evaluate your environment regularly across:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Infrastructure Health&lt;/li&gt;
&lt;li&gt;Resource Efficiency&lt;/li&gt;
&lt;li&gt;Scalability&lt;/li&gt;
&lt;li&gt;Observability&lt;/li&gt;
&lt;li&gt;Security&lt;/li&gt;
&lt;li&gt;Cost Governance&lt;/li&gt;
&lt;li&gt;Reliability&lt;/li&gt;
&lt;li&gt;Final Takeaway&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most successful Kubernetes teams don't optimize once—they optimize continuously. By regularly reviewing costs, performance, scalability, and security, organizations can build Kubernetes platforms that are both high-performing and cost-efficient.&lt;/p&gt;

&lt;p&gt;Remember: Kubernetes success isn't about running more containers. It's about running them smarter.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>cloudcomputing</category>
      <category>devops</category>
      <category>aws</category>
    </item>
    <item>
      <title>The Human-in-the-Loop SRE: Designing Automation Escalation Policies for AI-Assisted Operations</title>
      <dc:creator>Nijo George Payyappilly</dc:creator>
      <pubDate>Mon, 15 Jun 2026 16:00:00 +0000</pubDate>
      <link>https://dev.to/npayyappilly/the-human-in-the-loop-sre-designing-automation-escalation-policies-for-ai-assisted-operations-2c7f</link>
      <guid>https://dev.to/npayyappilly/the-human-in-the-loop-sre-designing-automation-escalation-policies-for-ai-assisted-operations-2c7f</guid>
      <description>&lt;p&gt;On April 23, 2021, a Fastly CDN configuration change triggered a global outage that took down the UK government website, the New York Times, Reddit, and hundreds of other major internet properties for approximately one hour. The triggering event was a configuration push. The propagation mechanism was automated. The time between the configuration being pushed and the global impact becoming visible was under a minute. The time required for a human operator to identify the cause and initiate the rollback was approximately forty-nine minutes longer than that.&lt;/p&gt;

&lt;p&gt;The Fastly incident is not primarily a story about automation failure. It is a story about the speed asymmetry between automated propagation and human response — and about what happens when the automation layer between a human decision and its production consequence moves faster than the accountability layer designed to govern it.&lt;/p&gt;

&lt;p&gt;This asymmetry is the defining operational challenge of AI-assisted SRE. The capability to automate incident detection, root cause hypothesis generation, and even remediation is now accessible at costs and latencies that were unavailable five years ago. The operational risk is not that this capability will be under-used. The risk is that it will be deployed without a rigorous escalation policy — a formal framework that defines exactly where automated execution ends and human judgement begins, under what conditions the boundary shifts, and how accountability is preserved for every action the AI takes on behalf of an operator who may not have been in the room when it was taken.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Human-in-the-Loop Spectrum
&lt;/h2&gt;

&lt;p&gt;AI-assisted SRE operations do not exist at a single point on the autonomy spectrum. They exist across a range, and the appropriate position on that range is a function of confidence, blast radius, novelty, and regulatory context — not of how sophisticated the AI system is.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;THE AUTOMATION AUTONOMY SPECTRUM
────────────────────────────────────────────────────────────────────────────

LEVEL 0 — MANUAL
  AI generates no recommendations. Human observes raw telemetry and decides.
  Appropriate when: AI system is unavailable, untrusted, or context is
  outside AI training distribution entirely.

LEVEL 1 — ASSISTED
  AI surfaces relevant context, correlated signals, and historical patterns.
  Human makes all decisions. AI does not recommend actions.
  Appropriate when: novel failure pattern; first occurrence of incident type;
  regulated change requiring documented human judgement.

LEVEL 2 — SUPERVISED
  AI recommends specific actions with confidence scores. Human approves
  each action before execution. AI does not execute autonomously.
  Appropriate when: high blast radius; unfamiliar but not novel pattern;
  action is reversible but consequential.

LEVEL 3 — CONDITIONAL AUTONOMOUS
  AI executes actions autonomously within pre-approved policy boundaries.
  Human is notified after execution. Human can abort within a defined window.
  Appropriate when: well-characterised failure pattern; low blast radius;
  action is fully reversible; pattern seen &amp;gt; N times with consistent outcome.

LEVEL 4 — AUTONOMOUS
  AI executes and verifies remediation without human notification unless
  verification fails. Audit trail maintained.
  Appropriate when: toil pattern fully characterised; action is idempotent;
  blast radius is bounded to a single service; recurrence rate justifies
  zero-latency response.

────────────────────────────────────────────────────────────────────────────
CRITICAL CONSTRAINT: No action may exist permanently at Level 4.
Every Level 4 automation must have a scheduled re-qualification review
that reassesses whether the failure pattern is still well-characterised
and the blast radius assumption still holds.
────────────────────────────────────────────────────────────────────────────
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The critical constraint — that no action may exist permanently at Level 4 — is not conservatism. It is the engineering response to a specific failure mode: automation that was correctly calibrated at deployment time and has silently drifted out of calibration as the system evolved. An OOM restart automation that was safe when first deployed becomes unsafe the moment the underlying cause shifts from a memory leak to a data corruption event that is triggering the same symptom. The re-qualification review is the mechanism that catches this drift before it produces an incident.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Four Escalation Triggers
&lt;/h2&gt;

&lt;p&gt;Every escalation policy is built from four primitive triggers. Each trigger defines a condition under which the automation level must shift upward — toward more human involvement, not less.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trigger 1 — Confidence Threshold Breach
&lt;/h3&gt;

&lt;p&gt;The AI system's confidence in its diagnosis or recommended action has fallen below a defined threshold. In the context of LLM-based operations (HolmesGPT, LiteLLM Proxy routing), confidence is expressed as a combination of model-reported token probability distributions and domain-specific heuristics applied to the recommendation output.&lt;/p&gt;

&lt;p&gt;A low-confidence diagnosis means the AI has identified a plausible pattern match but lacks sufficient corroborating signal to recommend action without human review. Executing actions based on low-confidence diagnoses is the operational equivalent of acting on a single data point in a monitoring dashboard: occasionally correct, reliably dangerous as a policy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trigger 2 — Blast Radius Threshold
&lt;/h3&gt;

&lt;p&gt;The proposed action affects more infrastructure than the policy authorises for autonomous execution. Blast radius is assessed across three dimensions: service count (how many services are affected), traffic fraction (what percentage of user requests are served by the affected infrastructure), and reversibility (can the action be undone in under five minutes with a single command).&lt;/p&gt;

&lt;p&gt;High blast radius is not a disqualifying condition for automation. It is a condition that requires the automation level to shift to at least Level 2 (supervised) regardless of confidence score.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trigger 3 — Novelty Detection
&lt;/h3&gt;

&lt;p&gt;The failure pattern does not match any pattern in the AI system's training corpus or historical incident database. Novelty is the most dangerous condition for autonomous execution because it is precisely the condition where the AI's pattern-matching capability provides the least value — and where a confident-sounding but incorrect recommendation carries the highest operational cost.&lt;/p&gt;

&lt;p&gt;Novelty detection is the hardest trigger to implement well, because it requires the AI system to accurately assess the boundaries of its own knowledge. A system that cannot reliably distinguish "I have seen this pattern and am confident" from "I have seen a superficially similar pattern and am extrapolating" should not be operating at Level 3 or Level 4.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trigger 4 — Regulatory Boundary
&lt;/h3&gt;

&lt;p&gt;The proposed action would touch a regulated asset, require a documented change record, affect a system subject to NERC CIP, PCI-DSS, HIPAA, or equivalent obligations, or generate a compliance event. In regulated environments, no automated action may bypass the change management governance framework, regardless of confidence score or blast radius.&lt;/p&gt;

&lt;p&gt;This trigger is absolute. It does not have a confidence threshold exception. An AI system that correctly diagnoses a production issue with 99% confidence and proposes a remediation that would constitute an undocumented change to a regulated asset must escalate to Level 2 and generate a change record, even if the remediation would restore service faster without it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Designing the Escalation Policy Document
&lt;/h2&gt;

&lt;p&gt;The escalation policy is an operational governance document, not a configuration file. It must be version-controlled, reviewed and approved by SRE leadership and compliance, and referenced in every AI-assisted automation's runtime configuration. Its authority derives from human review, not from the AI system that consults it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ESCALATION POLICY: AI-ASSISTED INCIDENT RESPONSE
────────────────────────────────────────────────────────────────────────────
Service:       production-platform (all services)
AI System:     HolmesGPT + LiteLLM Proxy + Ollama / GitHub Models
Policy Version: v1.3  |  Approved: SRE Lead + VP Engineering
Last Reviewed: 2025-Q1  |  Next Review: 2025-Q2
────────────────────────────────────────────────────────────────────────────

SECTION 1: AUTONOMOUS EXECUTION AUTHORISED (Level 4)
  Conditions required (ALL must be true):
    ✓ Confidence score ≥ 0.85 (model-reported + heuristic composite)
    ✓ Pattern seen ≥ 10 times in incident history with consistent outcome
    ✓ Blast radius: single service, single namespace, ≤ 20% of replicas
    ✓ Action is idempotent and fully reversible in ≤ 5 minutes
    ✓ No regulated asset in scope
    ✓ Error budget &amp;gt; 25% remaining (not in Tier 3 freeze)
  Authorised actions at Level 4:
    → Rolling restart of single stateless deployment (OOM, deadlock)
    → Scale-up of single HPA-managed deployment by ≤ 2 replicas
    → Certificate rotation on non-production workloads
    → Log pipeline gateway restart (telemetry outage, no production impact)
  Required logging: structured Splunk event per action (mandatory)
  Re-qualification: every 90 days or after any incident where autonomous
                   action was taken and outcome was suboptimal

SECTION 2: SUPERVISED EXECUTION (Level 2 — Human Approval Required)
  Conditions triggering Level 2 (ANY is sufficient):
    ⚠ Confidence score 0.60–0.84
    ⚠ Blast radius: &amp;gt; 20% of replicas OR &amp;gt; 1 service OR cross-namespace
    ⚠ First or second occurrence of this failure pattern
    ⚠ Error budget between 25–75% (Tier 2 degraded)
    ⚠ Action affects shared infrastructure (Argo CD, Prometheus, Istio)
  Approval mechanism: Slack approval button with 10-minute timeout
  Timeout behaviour: escalate to on-call if no response in 10 minutes
  Required logging: recommendation + approval/rejection + outcome

SECTION 3: ASSISTED ONLY (Level 1 — No Action Authorised)
  Conditions triggering Level 1 (ANY is sufficient):
    ✗ Confidence score &amp;lt; 0.60
    ✗ Novel failure pattern (no match in incident history)
    ✗ Regulated asset in scope (NERC CIP, PCI-DSS, HIPAA boundary)
    ✗ Error budget &amp;lt; 25% (Tier 3 freeze — deployment freeze active)
    ✗ Active P0 incident in progress (human incident commander owns scope)
    ✗ Multiple simultaneous incidents (blast radius assessment unreliable)
  AI role at Level 1: surface correlated signals, historical context only
  Human owns: diagnosis, action decision, execution, verification

SECTION 4: ACCOUNTABILITY CHAIN
  Every AI-assisted action must trace to one of:
    a) Direct human approval (Level 2 Slack approval button)
    b) This policy document (Level 4 autonomous execution)
  "The AI decided" is not a complete accountability chain.
  Policy document owner: SRE Lead
  Policy review and approval authority: SRE Lead + VP Engineering
────────────────────────────────────────────────────────────────────────────
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  HolmesGPT Escalation Architecture
&lt;/h2&gt;

&lt;p&gt;The escalation policy document defines the governance rules. The escalation architecture implements those rules as runtime logic in the AI-assisted operations stack. The architecture shown here is specific to the HolmesGPT + LiteLLM Proxy + Ollama deployment pattern in a regulated on-premises environment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5l95eyaya6urklfl6yv4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5l95eyaya6urklfl6yv4.png" alt="HolmesGPT Escalation Architecture"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# HolmesGPT Escalation Policy ConfigMap&lt;/span&gt;
&lt;span class="c1"&gt;# Consumed by HolmesGPT at runtime to determine autonomy level per action&lt;/span&gt;
&lt;span class="c1"&gt;# Version-controlled in git; updated only via Argo CD sync (change record enforced)&lt;/span&gt;

&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ConfigMap&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;holmesgpt-escalation-policy&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;holmesgpt&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;sre.internal/policy-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v1.3"&lt;/span&gt;
    &lt;span class="na"&gt;sre.internal/approved-by&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sre-lead,vp-engineering"&lt;/span&gt;
    &lt;span class="na"&gt;sre.internal/approved-date&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025-03-15"&lt;/span&gt;
    &lt;span class="na"&gt;sre.internal/next-review&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025-06-15"&lt;/span&gt;
    &lt;span class="na"&gt;sre.internal/review-enforced-by&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kyverno-policy/ai-ops-policy-review"&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;escalation_policy.yaml&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;confidence_thresholds:&lt;/span&gt;
      &lt;span class="s"&gt;autonomous:   0.85&lt;/span&gt;
      &lt;span class="s"&gt;supervised:   0.60&lt;/span&gt;
      &lt;span class="s"&gt;assisted_only: 0.0&lt;/span&gt;

    &lt;span class="s"&gt;blast_radius_limits:&lt;/span&gt;
      &lt;span class="s"&gt;autonomous:&lt;/span&gt;
        &lt;span class="s"&gt;max_replica_fraction: 0.20&lt;/span&gt;
        &lt;span class="s"&gt;max_service_count: 1&lt;/span&gt;
        &lt;span class="s"&gt;max_namespace_count: 1&lt;/span&gt;
        &lt;span class="s"&gt;cross_namespace_allowed: false&lt;/span&gt;
        &lt;span class="s"&gt;regulated_assets_allowed: false&lt;/span&gt;

    &lt;span class="s"&gt;autonomous_actions_allowlist:&lt;/span&gt;
      &lt;span class="s"&gt;- action: rolling_restart_stateless&lt;/span&gt;
        &lt;span class="s"&gt;max_replicas_affected: 5&lt;/span&gt;
        &lt;span class="s"&gt;requires_pdb_check: true&lt;/span&gt;
      &lt;span class="s"&gt;- action: hpa_scale_up&lt;/span&gt;
        &lt;span class="s"&gt;max_replica_delta: 2&lt;/span&gt;
        &lt;span class="s"&gt;requires_current_below_sot: true&lt;/span&gt;
      &lt;span class="s"&gt;- action: log_pipeline_restart&lt;/span&gt;
        &lt;span class="s"&gt;namespaces: [monitoring, sre-platform]&lt;/span&gt;
        &lt;span class="s"&gt;production_namespaces_blocked: true&lt;/span&gt;

    &lt;span class="s"&gt;error_budget_gates:&lt;/span&gt;
      &lt;span class="s"&gt;tier_3_freeze_blocks_autonomous: true&lt;/span&gt;
      &lt;span class="s"&gt;tier_2_degrades_to_supervised: true&lt;/span&gt;

    &lt;span class="s"&gt;regulatory_boundary:&lt;/span&gt;
      &lt;span class="s"&gt;always_level_1_namespaces:&lt;/span&gt;
        &lt;span class="s"&gt;- pci-zone&lt;/span&gt;
        &lt;span class="s"&gt;- hipaa-zone&lt;/span&gt;
        &lt;span class="s"&gt;- nerc-cip-zone&lt;/span&gt;
      &lt;span class="s"&gt;always_level_1_labels:&lt;/span&gt;
        &lt;span class="s"&gt;- "compliance.internal/regulated=true"&lt;/span&gt;

    &lt;span class="s"&gt;novelty_detection:&lt;/span&gt;
      &lt;span class="s"&gt;min_historical_occurrences_for_autonomous: 10&lt;/span&gt;
      &lt;span class="s"&gt;similarity_threshold: 0.80&lt;/span&gt;
      &lt;span class="s"&gt;unknown_pattern_forces_level_1: true&lt;/span&gt;

    &lt;span class="s"&gt;approval_workflow:&lt;/span&gt;
      &lt;span class="s"&gt;slack_channel: "sre-aiops-approvals"&lt;/span&gt;
      &lt;span class="s"&gt;timeout_minutes: 10&lt;/span&gt;
      &lt;span class="s"&gt;timeout_action: escalate_to_oncall&lt;/span&gt;

    &lt;span class="s"&gt;audit:&lt;/span&gt;
      &lt;span class="s"&gt;splunk_sourcetype: "sre:holmesgpt:decisions"&lt;/span&gt;
      &lt;span class="s"&gt;log_all_recommendations: true&lt;/span&gt;
      &lt;span class="s"&gt;log_operator_overrides: true&lt;/span&gt;
      &lt;span class="s"&gt;override_feeds_prompt_review: true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Model Routing for Escalation Quality
&lt;/h2&gt;

&lt;p&gt;The LiteLLM Proxy's model routing configuration is a first-class component of the escalation architecture. Routing to the right model at the right confidence tier is not a performance optimisation — it is a safety mechanism.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# LiteLLM Proxy — Model Routing for Escalation Tiers&lt;/span&gt;
&lt;span class="c1"&gt;# Smaller local models for low blast radius / routine patterns&lt;/span&gt;
&lt;span class="c1"&gt;# Larger models with greater context window for high blast radius / novel patterns&lt;/span&gt;
&lt;span class="c1"&gt;# On-premises models for regulated asset investigations (data sovereignty)&lt;/span&gt;

&lt;span class="na"&gt;model_list&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Tier 1: Routine investigation — local Ollama model&lt;/span&gt;
  &lt;span class="c1"&gt;# Low latency, no data egress, adequate for well-characterised patterns&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;holmesgpt-routine&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ollama/llama3.1:8b&lt;/span&gt;
      &lt;span class="na"&gt;api_base&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://ollama.ai-ops.svc.cluster.local:11434&lt;/span&gt;
      &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
      &lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2048&lt;/span&gt;

  &lt;span class="c1"&gt;# Tier 2: Complex investigation — larger local model&lt;/span&gt;
  &lt;span class="c1"&gt;# Higher accuracy for multi-service correlation and novel patterns&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;holmesgpt-complex&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ollama/llama3.1:70b&lt;/span&gt;
      &lt;span class="na"&gt;api_base&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://ollama.ai-ops.svc.cluster.local:11434&lt;/span&gt;
      &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;90&lt;/span&gt;
      &lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8192&lt;/span&gt;

  &lt;span class="c1"&gt;# Tier 3: High-stakes / novel pattern — GitHub Models&lt;/span&gt;
  &lt;span class="c1"&gt;# Largest context window for multi-service incident correlation&lt;/span&gt;
  &lt;span class="c1"&gt;# Data classification check required before routing: no PII, no regulated data&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;holmesgpt-highstakes&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;github/gpt-4o&lt;/span&gt;
      &lt;span class="na"&gt;api_base&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://models.inference.ai.azure.com&lt;/span&gt;
      &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;os.environ/GITHUB_MODELS_PAT"&lt;/span&gt;
      &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;120&lt;/span&gt;
      &lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;16384&lt;/span&gt;

&lt;span class="na"&gt;router_settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;routing_strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;custom&lt;/span&gt;
  &lt;span class="na"&gt;routing_logic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;# Route by blast_radius_tier header set by HolmesGPT pre-routing assessment&lt;/span&gt;
    &lt;span class="s"&gt;if blast_radius_tier == "low" and pattern_novelty == "known":&lt;/span&gt;
        &lt;span class="s"&gt;return "holmesgpt-routine"&lt;/span&gt;
    &lt;span class="s"&gt;elif blast_radius_tier == "high" or pattern_novelty == "novel":&lt;/span&gt;
        &lt;span class="s"&gt;# Data classification gate before external model routing&lt;/span&gt;
        &lt;span class="s"&gt;if data_contains_regulated_fields:&lt;/span&gt;
            &lt;span class="s"&gt;return "holmesgpt-complex"  # Stay on-premises&lt;/span&gt;
        &lt;span class="s"&gt;return "holmesgpt-highstakes"&lt;/span&gt;
    &lt;span class="s"&gt;else:&lt;/span&gt;
        &lt;span class="s"&gt;return "holmesgpt-complex"&lt;/span&gt;

  &lt;span class="na"&gt;fallback_model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;holmesgpt-complex&lt;/span&gt;    &lt;span class="c1"&gt;# Always fall back to on-premises&lt;/span&gt;
  &lt;span class="na"&gt;fallback_on_status_codes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;429&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;500&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;503&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Recommendation Quality Feedback Loop
&lt;/h2&gt;

&lt;p&gt;The operational risk of AI-assisted recommendations is not static. It evolves as the system changes and as the model's training distribution diverges from the current operational reality. An AI recommendation quality feedback loop is the mechanism that makes this drift visible before it produces a damaging autonomous action.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Prometheus Recording Rules — AI Recommendation Quality Tracking&lt;/span&gt;
&lt;span class="c1"&gt;# Measures whether HolmesGPT recommendations are operationally valuable&lt;/span&gt;
&lt;span class="c1"&gt;# High override rate or low action rate = recommendation quality degrading&lt;/span&gt;

&lt;span class="na"&gt;groups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;holmesgpt.recommendation_quality&lt;/span&gt;
    &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;

      &lt;span class="c1"&gt;# Recommendation acceptance rate: fraction of recommendations&lt;/span&gt;
      &lt;span class="c1"&gt;# that operators acted on (approved or executed autonomously)&lt;/span&gt;
      &lt;span class="c1"&gt;# versus rejected or ignored&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;record&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;holmesgpt:recommendation_acceptance_rate:rate7d&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;sum(rate(holmesgpt_recommendations_acted_on_total[7d]))&lt;/span&gt;
          &lt;span class="s"&gt;/&lt;/span&gt;
          &lt;span class="s"&gt;sum(rate(holmesgpt_recommendations_total[7d]))&lt;/span&gt;

      &lt;span class="c1"&gt;# Operator override rate: fraction of autonomous actions that&lt;/span&gt;
      &lt;span class="c1"&gt;# were manually reversed by an operator after execution&lt;/span&gt;
      &lt;span class="c1"&gt;# High rate = autonomous confidence thresholds are too permissive&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;record&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;holmesgpt:autonomous_override_rate:rate7d&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;sum(rate(holmesgpt_autonomous_actions_reversed_total[7d]))&lt;/span&gt;
          &lt;span class="s"&gt;/&lt;/span&gt;
          &lt;span class="s"&gt;sum(rate(holmesgpt_autonomous_actions_total[7d]))&lt;/span&gt;

      &lt;span class="c1"&gt;# False positive rate: recommendations made but outcome was&lt;/span&gt;
      &lt;span class="c1"&gt;# NOT the recommended action resolving the incident&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;record&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;holmesgpt:false_positive_rate:rate7d&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;sum(rate(holmesgpt_recommendations_outcome_mismatch_total[7d]))&lt;/span&gt;
          &lt;span class="s"&gt;/&lt;/span&gt;
          &lt;span class="s"&gt;sum(rate(holmesgpt_recommendations_acted_on_total[7d]))&lt;/span&gt;

      &lt;span class="c1"&gt;# Alert: recommendation quality degrading&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HolmesGPT_RecommendationQualityDegrading&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;holmesgpt:autonomous_override_rate:rate7d &amp;gt; 0.15&lt;/span&gt;
          &lt;span class="s"&gt;OR&lt;/span&gt;
          &lt;span class="s"&gt;holmesgpt:false_positive_rate:rate7d &amp;gt; 0.20&lt;/span&gt;
        &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1d&lt;/span&gt;
        &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ticket&lt;/span&gt;
          &lt;span class="na"&gt;domain&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ai_ops_quality&lt;/span&gt;
        &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
            &lt;span class="s"&gt;HolmesGPT recommendation quality below threshold.&lt;/span&gt;
            &lt;span class="s"&gt;Override rate: {{ with query "holmesgpt:autonomous_override_rate:rate7d" }}&lt;/span&gt;
            &lt;span class="s"&gt;{{ . | first | value | humanizePercentage }}{{ end }}.&lt;/span&gt;
            &lt;span class="s"&gt;Action: review recent overrides, update prompt context,&lt;/span&gt;
            &lt;span class="s"&gt;consider reducing autonomous confidence threshold.&lt;/span&gt;
          &lt;span class="na"&gt;runbook&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://wiki.internal/sre/runbooks/holmesgpt-quality-review"&lt;/span&gt;

      &lt;span class="c1"&gt;# Alert: recommendation volume causing alert fatigue risk&lt;/span&gt;
      &lt;span class="c1"&gt;# More than 3 recommendations per incident = cognitive overload signal&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HolmesGPT_RecommendationVolumeHigh&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;sum(rate(holmesgpt_recommendations_total[1h]))&lt;/span&gt;
          &lt;span class="s"&gt;/&lt;/span&gt;
          &lt;span class="s"&gt;sum(rate(incidents_opened_total[1h])) &amp;gt; 3&lt;/span&gt;
        &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30m&lt;/span&gt;
        &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ticket&lt;/span&gt;
        &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
            &lt;span class="s"&gt;HolmesGPT generating &amp;gt; 3 recommendations per incident on average.&lt;/span&gt;
            &lt;span class="s"&gt;Risk: alert fatigue causing operators to ignore recommendations.&lt;/span&gt;
            &lt;span class="s"&gt;Action: tighten confidence floor or reduce recommendation scope.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Accountability Chain Principle and NIST AI RMF Alignment
&lt;/h2&gt;

&lt;p&gt;The accountability chain principle — that every AI-assisted action must trace back to a human decision, either a direct approval or a policy that a human wrote and approved — is the operational implementation of the NIST AI Risk Management Framework's GOVERN function.&lt;/p&gt;

&lt;p&gt;The NIST AI RMF establishes four core functions for AI risk management: GOVERN (policies, accountability), MAP (risk identification), MEASURE (risk quantification), and MANAGE (risk response). Each function maps directly to components of the escalation policy architecture.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NIST AI RMF MAPPING: AI-ASSISTED SRE OPERATIONS
────────────────────────────────────────────────────────────────────────────

GOVERN — Accountability and Policy
  Who owns the AI system's outputs?
    → SRE Lead owns escalation policy; VP Engineering co-approves
  Who approves autonomous action boundaries?
    → Policy document with named approvers and review cadence
  How are accountability chains maintained?
    → Splunk audit trail: every recommendation, decision, and outcome
  SRE implementation: escalation policy document + approval workflow

MAP — Risk Identification
  What failure modes does the AI system face?
    → Confidence decay: model accuracy degrades as system evolves
    → Distribution shift: production patterns diverge from training data
    → Novel pattern extrapolation: confident recommendation on unfamiliar input
    → Blast radius miscalculation: action scope larger than assessed
  SRE implementation: four escalation triggers + novelty detection

MEASURE — Risk Quantification
  How do you measure AI recommendation quality over time?
    → Acceptance rate: fraction of recommendations acted on
    → Override rate: fraction of autonomous actions manually reversed
    → False positive rate: recommendations where predicted outcome was wrong
    → Confidence calibration: does 85% confidence actually mean 85% accuracy?
  SRE implementation: Prometheus quality recording rules + 7-day rolling metrics

MANAGE — Risk Response
  What happens when AI recommendation quality degrades?
    → Automatic downgrade of autonomous confidence threshold
    → Prompt context refresh from recent incident postmortems
    → Temporary suspension of Level 4 autonomy pending review
  SRE implementation: quality alert → runbook → policy review cadence
────────────────────────────────────────────────────────────────────────────
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Splunk Audit Trail: The Irreplaceable Governance Layer
&lt;/h2&gt;

&lt;p&gt;In regulated environments, the audit trail for AI-assisted actions is not optional. It is the documentary evidence that demonstrates human accountability over automated decisions — the record that answers the auditor's question: "Who authorised this change to your production system?"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Splunk HEC Forwarder — HolmesGPT Decision Audit Trail&lt;/span&gt;
&lt;span class="c1"&gt;# Every recommendation, escalation decision, and outcome → Splunk&lt;/span&gt;
&lt;span class="c1"&gt;# This record is the accountability chain in documentary form&lt;/span&gt;

&lt;span class="c1"&gt;# Splunk event structure (sourcetype: sre:holmesgpt:decisions):&lt;/span&gt;
&lt;span class="c1"&gt;# {&lt;/span&gt;
&lt;span class="c1"&gt;#   "timestamp": "2025-04-15T14:23:07Z",&lt;/span&gt;
&lt;span class="c1"&gt;#   "incident_id": "INC-20250415-0047",&lt;/span&gt;
&lt;span class="c1"&gt;#   "alert_name": "KubePodOOMKilled",&lt;/span&gt;
&lt;span class="c1"&gt;#   "service": "payments-api",&lt;/span&gt;
&lt;span class="c1"&gt;#   "namespace": "production",&lt;/span&gt;
&lt;span class="c1"&gt;#&lt;/span&gt;
&lt;span class="c1"&gt;#   "investigation": {&lt;/span&gt;
&lt;span class="c1"&gt;#     "model_used": "holmesgpt-routine",&lt;/span&gt;
&lt;span class="c1"&gt;#     "model_backend": "ollama/llama3.1:8b",&lt;/span&gt;
&lt;span class="c1"&gt;#     "confidence_score": 0.91,&lt;/span&gt;
&lt;span class="c1"&gt;#     "diagnosis": "Memory limit (2Gi) exceeded by 847MB under high load...",&lt;/span&gt;
&lt;span class="c1"&gt;#     "recommended_action": "rolling_restart_stateless",&lt;/span&gt;
&lt;span class="c1"&gt;#     "blast_radius_assessment": {&lt;/span&gt;
&lt;span class="c1"&gt;#       "services_affected": 1,&lt;/span&gt;
&lt;span class="c1"&gt;#       "replica_fraction": 0.15,&lt;/span&gt;
&lt;span class="c1"&gt;#       "reversible": true,&lt;/span&gt;
&lt;span class="c1"&gt;#       "regulated_asset": false&lt;/span&gt;
&lt;span class="c1"&gt;#     }&lt;/span&gt;
&lt;span class="c1"&gt;#   },&lt;/span&gt;
&lt;span class="c1"&gt;#&lt;/span&gt;
&lt;span class="c1"&gt;#   "escalation_decision": {&lt;/span&gt;
&lt;span class="c1"&gt;#     "autonomy_level": 4,&lt;/span&gt;
&lt;span class="c1"&gt;#     "policy_version": "v1.3",&lt;/span&gt;
&lt;span class="c1"&gt;#     "triggers_evaluated": ["confidence", "blast_radius", "novelty", "regulatory"],&lt;/span&gt;
&lt;span class="c1"&gt;#     "triggers_fired": [],&lt;/span&gt;
&lt;span class="c1"&gt;#     "decision": "AUTONOMOUS_EXECUTE",&lt;/span&gt;
&lt;span class="c1"&gt;#     "policy_authority": "holmesgpt-escalation-policy v1.3 (approved: sre-lead)"&lt;/span&gt;
&lt;span class="c1"&gt;#   },&lt;/span&gt;
&lt;span class="c1"&gt;#&lt;/span&gt;
&lt;span class="c1"&gt;#   "execution": {&lt;/span&gt;
&lt;span class="c1"&gt;#     "action_taken": "rolling_restart_stateless",&lt;/span&gt;
&lt;span class="c1"&gt;#     "execution_start": "2025-04-15T14:23:09Z",&lt;/span&gt;
&lt;span class="c1"&gt;#     "verification_result": "HEALTHY",&lt;/span&gt;
&lt;span class="c1"&gt;#     "mttr_seconds": 67,&lt;/span&gt;
&lt;span class="c1"&gt;#     "operator_override": false&lt;/span&gt;
&lt;span class="c1"&gt;#   },&lt;/span&gt;
&lt;span class="c1"&gt;#&lt;/span&gt;
&lt;span class="c1"&gt;#   "quality_signals": {&lt;/span&gt;
&lt;span class="c1"&gt;#     "prediction_matched_outcome": true,&lt;/span&gt;
&lt;span class="c1"&gt;#     "error_budget_consumed_pct": 0.002,&lt;/span&gt;
&lt;span class="c1"&gt;#     "operator_satisfaction": null    # Populated by post-incident feedback&lt;/span&gt;
&lt;span class="c1"&gt;#   }&lt;/span&gt;
&lt;span class="c1"&gt;# }&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;policy_authority&lt;/code&gt; field in the escalation decision block is the accountability chain closure. It names the specific policy document version and its human approvers. When an auditor asks who authorised the autonomous action, the answer is not "the AI decided" — it is "the SRE Lead and VP Engineering approved escalation policy v1.3 on 2025-03-15, and this action fell within the boundaries of Section 1 of that policy."&lt;/p&gt;




&lt;h2&gt;
  
  
  The Confidence Calibration Problem
&lt;/h2&gt;

&lt;p&gt;A confidence score of 0.85 from a language model does not intrinsically mean that the recommendation is correct 85% of the time. Language models are notoriously poorly calibrated — they express high confidence in incorrect outputs and sometimes express low confidence in correct ones. The confidence threshold in the escalation policy must be calibrated against the AI system's actual historical accuracy, not against the model's self-reported certainty.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Splunk SPL: Confidence Calibration Assessment&lt;/span&gt;
&lt;span class="c1"&gt;-- Compares model-reported confidence bands against actual outcome accuracy&lt;/span&gt;
&lt;span class="c1"&gt;-- Run monthly; output informs confidence threshold calibration in policy&lt;/span&gt;

&lt;span class="k"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sre_holmesgpt&lt;/span&gt; &lt;span class="n"&gt;sourcetype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;"sre:holmesgpt:decisions"&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;eval&lt;/span&gt; &lt;span class="n"&gt;confidence_band&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;confidence_score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"90-100%"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;confidence_score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"85-89%"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;confidence_score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"80-84%"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;confidence_score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;70&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"70-79%"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;confidence_score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"60-69%"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;                   &lt;span class="nv"&gt;"&amp;lt;60%"&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;stats&lt;/span&gt;
    &lt;span class="k"&gt;count&lt;/span&gt;                                          &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total_recommendations&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prediction_matched_outcome&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;correct_predictions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prediction_matched_outcome&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;empirical_accuracy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;operator_override&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                         &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;operator_overrides&lt;/span&gt;
    &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;confidence_band&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_used&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;eval&lt;/span&gt;
    &lt;span class="n"&gt;calibration_delta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;empirical_accuracy&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;substr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;confidence_band&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;calibration_status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;calibration_delta&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"CALIBRATED"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"MISCALIBRATED"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="k"&gt;table&lt;/span&gt;
    &lt;span class="n"&gt;confidence_band&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_used&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_recommendations&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;empirical_accuracy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;calibration_delta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;calibration_status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;operator_overrides&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;sort&lt;/span&gt; &lt;span class="n"&gt;confidence_band&lt;/span&gt;

&lt;span class="c1"&gt;-- If empirical_accuracy at "85-89%" band is actually 0.71:&lt;/span&gt;
&lt;span class="c1"&gt;-- The 0.85 autonomous threshold is accepting actions that are only&lt;/span&gt;
&lt;span class="c1"&gt;-- correct 71% of the time. Raise threshold or re-evaluate model.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Common Antipatterns
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The Confidence Theatre antipattern&lt;/strong&gt; → Using model-reported confidence scores as the primary autonomous execution gate without calibration against empirical outcome accuracy. A model that reports 0.92 confidence but is empirically correct 68% of the time is a dangerous basis for autonomous action. Calibration against historical outcomes must precede the deployment of any confidence-based gate.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The Policy-as-Default antipattern&lt;/strong&gt; → Deploying the AI system with permissive defaults and planning to tighten the escalation policy "after we see how it performs in production." The escalation policy must be the first artefact produced, not a retroactive constraint on a system that is already taking autonomous actions. Permissive defaults in AI operations systems are not starting points; they are incident preconditions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The Accountability Diffusion antipattern&lt;/strong&gt; → Designing the system so that no single person is clearly accountable for an autonomous AI action. "The AI did it" is not an accountability chain. "The escalation policy approved by [names] on [date] authorised this class of action" is. In regulated environments, the inability to name a responsible human for a production change is itself a compliance finding.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The Alert Fatigue Transfer antipattern&lt;/strong&gt; → Moving from a system that generates too many monitoring alerts to a system that generates too many AI recommendations. If HolmesGPT surfaces seven recommendations per incident, operators will start ignoring them at the same rate they ignore high-volume monitoring alerts. Recommendation volume should be governed by the same principles as alert volume: every recommendation must be actionable, and the threshold for surfacing should be higher than the threshold for suppressing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The Permanent Level 4 antipattern&lt;/strong&gt; → Classifying an autonomous action as Level 4 and never re-qualifying it. The re-qualification cadence is the mechanism that prevents a well-calibrated autonomous action from silently becoming a dangerous one as the system evolves. Every Level 4 action must carry a &lt;code&gt;sre.internal/sot-next-review&lt;/code&gt; equivalent annotation and a Kyverno policy that generates a ticket when the date passes.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Maturity Progression
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;────────────────────────────────────────────────────────────────────────────
STAGE        AI-OPS ESCALATION STATE             NORTH STAR SIGNAL
────────────────────────────────────────────────────────────────────────────
Reactive     No AI-assisted operations.          All investigation is
             Operators work from raw             manual. MTTR limited
             telemetry only.                     by human availability.

Defined      HolmesGPT deployed at              AI operating at Level
             Level 1 only. Escalation           1–2 only. Context
             policy drafted but not             surfacing measurably
             yet governing autonomous           reduces investigation
             action.                            time.

Measured     Escalation policy governs          Recommendation quality
             Level 3–4 boundaries.              metrics tracked. Confidence
             Audit trail in Splunk.             calibration assessed
             Quality metrics active.            monthly. Override rate
                                                below 15%.

Optimised    Confidence calibration             Level 4 actions cover
             cycle running quarterly.           top-5 toil remediations.
             Model routing by blast             MTTR for covered patterns
             radius operational.                &amp;lt; 5 minutes (automated).
             NIST AI RMF aligned.               Audit trail satisfies
                                                regulatory review.

Generative   Escalation policy published        Policy cited in industry
             as reference architecture.         guidance. Recommendation
             Feedback loop feeds               quality above 85%.
             prompt engineering cycle.          AI-ops layer itself
             AI-ops treated as a               has SLO and error budget.
             production service.
────────────────────────────────────────────────────────────────────────────
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Five Action Items for This Week
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Draft your escalation policy document before configuring any autonomous action in HolmesGPT.&lt;/strong&gt; Start with the accountability chain section: who owns the policy, who approves autonomous action boundaries, and what the change record looks like. A policy document that exists on paper but has not been approved by SRE leadership and VP Engineering is not a governance artefact — it is a draft. The approval is the governance act.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Run the Splunk confidence calibration query against your last 90 days of HolmesGPT decisions.&lt;/strong&gt; If you do not yet have 90 days of data, start collecting it now at Level 1 only. Calibration data must precede autonomous execution boundaries. The calibration query is the empirical basis for your confidence thresholds — thresholds chosen without it are guesses with operational consequences.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Map every existing automated remediation to an autonomy level and a blast radius assessment.&lt;/strong&gt; For each automation in your Class 1 (Reactive Remediation) category from the automation taxonomy post, assess: what is its blast radius under worst-case conditions, and what confidence mechanism governs when it executes? Automations with no explicit blast radius boundary and no confidence mechanism are operating at implicit Level 4 without a policy. Make the policy explicit before the next incident.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Configure the recommendation quality Prometheus rules and set a 30-day baseline.&lt;/strong&gt; Even if you are operating at Level 1 only, begin measuring acceptance rate and false positive rate now. The first meaningful governance conversation about elevating to Level 3 or Level 4 should be anchored in empirical quality data, not in enthusiasm about the capability.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Add the four escalation triggers as literal fields to your HolmesGPT Splunk audit events.&lt;/strong&gt; Every decision event should record: &lt;code&gt;confidence_trigger_fired: true/false&lt;/code&gt;, &lt;code&gt;blast_radius_trigger_fired: true/false&lt;/code&gt;, &lt;code&gt;novelty_trigger_fired: true/false&lt;/code&gt;, &lt;code&gt;regulatory_trigger_fired: true/false&lt;/code&gt;. Over time, this data reveals which triggers are governing your escalation decisions most frequently — and which failure modes your autonomous boundary is most exposed to.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"The risk in AI-assisted SRE is not that the automation will fail to act. The risk is that it will act confidently, at scale, on a pattern it has only partially understood — and that the human who approved the policy that authorised the action will not be reachable, will not remember what the policy said, or will not realise the policy applied to this situation. The escalation policy is not a constraint on AI capability. It is the engineering discipline that makes AI capability safe to deploy in systems where the cost of being confidently wrong is borne by users, not by the model."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What Comes Next
&lt;/h2&gt;

&lt;p&gt;The escalation policy governs how AI recommendations become actions. The harder engineering problem is the quality of the recommendations themselves — specifically, how to evaluate LLM reliability for incident diagnosis with the same rigour that SRE applies to any other production dependency. The next post examines what it means to apply an SLO framework to an AI system: defining SLIs for recommendation accuracy, precision, and recall; setting error budgets for the AI-ops layer; and designing the automated quality gates that prevent a degrading LLM backend from silently undermining the operational decisions that depend on it.&lt;/p&gt;




</description>
      <category>sre</category>
      <category>devops</category>
      <category>ai</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Why vLLM autoscaling on Kubernetes breaks (and what to use instead)</title>
      <dc:creator>Sonia</dc:creator>
      <pubDate>Mon, 15 Jun 2026 15:10:34 +0000</pubDate>
      <link>https://dev.to/soniarotglam/why-vllm-autoscaling-on-kubernetes-breaks-and-what-to-use-instead-1231</link>
      <guid>https://dev.to/soniarotglam/why-vllm-autoscaling-on-kubernetes-breaks-and-what-to-use-instead-1231</guid>
      <description>&lt;p&gt;If you deploy vLLM on Kubernetes and reach for the standard HPA-on-CPU autoscaling, you will ship something that looks fine in testing and falls apart under real traffic.&lt;br&gt;
Here is why, and what to do instead.&lt;/p&gt;
&lt;h2&gt;
  
  
  The problem: Kubernetes can't see your inference load
&lt;/h2&gt;

&lt;p&gt;HPA scales on CPU and memory by default. Both are useless signals for LLM inference.&lt;br&gt;
CPU stays low because the GPU does the work. A vLLM pod serving zero requests and one serving 100 show nearly identical CPU.&lt;br&gt;
GPU memory stays constant because vLLM pre-allocates it for the KV cache at startup. It never moves, so it never triggers a scaling decision.&lt;br&gt;
So under heavy load, your vLLM deployment looks idle to Kubernetes while requests pile up inside the engine and latency climbs. The scheduler has no idea anything is wrong.&lt;/p&gt;
&lt;h2&gt;
  
  
  The fix: autoscale on the metrics that reflect real load
&lt;/h2&gt;

&lt;p&gt;The signals that matter live inside vLLM and are exported on its Prometheus endpoint:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;vllm:num_requests_waiting&lt;/code&gt; : how many requests are queued.&lt;br&gt;
&lt;code&gt;vllm:gpu_cache_usage_perc&lt;/code&gt; : how full the KV cache is.&lt;/p&gt;

&lt;p&gt;Wire these into KEDA, not HPA:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;keda.sh/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ScaledObject&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm-scaler&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-inference&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;scaleTargetRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm-inference&lt;/span&gt;
  &lt;span class="na"&gt;minReplicaCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;maxReplicaCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;
  &lt;span class="na"&gt;cooldownPeriod&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;300&lt;/span&gt;
  &lt;span class="na"&gt;triggers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prometheus&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;serverAddress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://prometheus:9090&lt;/span&gt;
      &lt;span class="na"&gt;metricName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm_queue_depth&lt;/span&gt;
      &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10"&lt;/span&gt;
      &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;sum(vllm:num_requests_waiting) /&lt;/span&gt;
        &lt;span class="s"&gt;count(kube_deployment_status_replicas_ready{deployment="vllm-inference"})&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prometheus&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;serverAddress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://prometheus:9090&lt;/span&gt;
      &lt;span class="na"&gt;metricName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm_kv_cache&lt;/span&gt;
      &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.8"&lt;/span&gt;
      &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;avg(vllm:gpu_cache_usage_perc)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;cooldownPeriod: 300&lt;/code&gt; is not optional. vLLM cold start takes minutes. A short cooldown thrashes, scaling down then immediately back up, paying the cold-start cost on repeat.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three more things that bite in production
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Cold start.&lt;/strong&gt; Pre-cache model weights on a ReadOnlyMany PVC. Downloading a 7B model from HuggingFace on every scale-up adds 2-5 minutes. Mounting from a pre-populated PVC cuts that to 30-60 seconds. Set a &lt;code&gt;startupProbe&lt;/code&gt; with a high &lt;code&gt;failureThreshold&lt;/code&gt;so Kubernetes doesn't kill the pod mid-load.&lt;br&gt;
&lt;strong&gt;KV cache OOM.&lt;/strong&gt; The most common vLLM crash. Size it with &lt;code&gt;params_B × 2&lt;/code&gt; for FP16 weights plus 25% for the cache, and set &lt;code&gt;--gpu-memory-utilization 0.85&lt;/code&gt; for headroom. 0.95 will OOM under concurrent load.&lt;br&gt;
&lt;strong&gt;Preemption.&lt;/strong&gt; When the KV cache fills, vLLM silently preempts older requests to make room. P99 latency can spike 8x. Alert on &lt;code&gt;rate(vllm:num_preemptions_total[1m]) &amp;gt; 0.05&lt;/code&gt; for 30s. It's the earliest warning you'll get.&lt;/p&gt;

&lt;h2&gt;
  
  
  The metric that matters most
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Watch preemption rate.&lt;/strong&gt; Standard Kubernetes monitoring (CPU, memory, restarts) tells you nothing about inference health. Preemption rate precedes the latency spikes your users actually feel. If you alert on one vLLM metric, alert on that one.&lt;/p&gt;

&lt;p&gt;I wrote the full version with the GPU memory sizing rules, Spot vs On-Demand node strategy, cold-start mitigation including NVIDIA Dynamo Snapshot, the four metrics to monitor, and a production checklist over on our blog: [&lt;a href="https://thegoodshell.com/vllm-kubernetes/" rel="noopener noreferrer"&gt;https://thegoodshell.com/vllm-kubernetes/&lt;/a&gt;]&lt;br&gt;
What signals are you autoscaling LLM inference on? Curious if anyone has found something better than queue depth + KV cache.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>ai</category>
      <category>devops</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>K9s — The Terminal UI That Made Me Love K8s Again</title>
      <dc:creator>Sohana Akbar</dc:creator>
      <pubDate>Mon, 15 Jun 2026 14:58:20 +0000</pubDate>
      <link>https://dev.to/sohanaakbar7/k9s-the-terminal-ui-that-made-me-love-k8s-again-2kgp</link>
      <guid>https://dev.to/sohanaakbar7/k9s-the-terminal-ui-that-made-me-love-k8s-again-2kgp</guid>
      <description>&lt;p&gt;Let me be honest with you.&lt;/p&gt;

&lt;p&gt;For the first two years of my Kubernetes journey, I was that person. You know the one. The person who says, “Kubernetes is amazing, but the CLI friction kills my flow.”&lt;/p&gt;

&lt;p&gt;I had aliases for my aliases. I used kubectx and kubens religiously. I even built a messy bash script to tail logs from five pods at once. And still, debugging a rollout felt like doing surgery with oven mitts.&lt;/p&gt;

&lt;p&gt;Then I found K9s.&lt;/p&gt;

&lt;p&gt;And I swear to you: within ten minutes, I actually enjoyed managing my cluster again.&lt;/p&gt;

&lt;p&gt;The Problem: Too Much Typing, Not Enough Seeing&lt;br&gt;
Here’s the dirty secret of kubectl: it’s a single-purpose knife in a multi-course meal. Want to check pod status? kubectl get pods. Want to see logs? kubectl logs -f pod-name. Want to describe a deployment? New command. Scale it? Another command. Jump into a shell? You get the idea.&lt;/p&gt;

&lt;p&gt;The cognitive load isn't the cluster—it’s the context switching between commands and namespaces.&lt;/p&gt;

&lt;p&gt;K9s solves that by turning your terminal into a live, interactive cockpit.&lt;/p&gt;

&lt;p&gt;What Is K9s, Really?&lt;br&gt;
K9s is a Terminal UI (TUI) that watches your cluster in real time. It’s not a dashboard you open in a browser. It lives inside your terminal, respects your SSH config, uses your existing kubeconfig, and weighs essentially nothing.&lt;/p&gt;

&lt;p&gt;But the magic isn't the tech specs. The magic is the ergonomics.&lt;/p&gt;

&lt;p&gt;When you launch k9s, you don’t type --namespace. You don’t remember flags. You press : (colon) and type /pods, /deploy, /svc, or /ctx to switch contexts instantly.&lt;/p&gt;

&lt;p&gt;Let me show you how my workflow changed.&lt;/p&gt;

&lt;p&gt;Before vs. After K9s&lt;br&gt;
Before (pure kubectl)&lt;br&gt;
bash&lt;br&gt;
kubectl get pods -n myapp&lt;/p&gt;

&lt;h1&gt;
  
  
  oh, pod is crashing
&lt;/h1&gt;

&lt;p&gt;kubectl logs myapp-pod-7d8f9-abc -n myapp --tail=50&lt;/p&gt;

&lt;h1&gt;
  
  
  hmm, need to see events
&lt;/h1&gt;

&lt;p&gt;kubectl describe pod myapp-pod-7d8f9-abc -n myapp | grep -A 5 Events&lt;/p&gt;

&lt;h1&gt;
  
  
  ok scale down
&lt;/h1&gt;

&lt;p&gt;kubectl scale deploy/myapp -n myapp --replicas=0&lt;br&gt;
That’s 4 commands, 2 copy-pastes, and one grep.&lt;/p&gt;

&lt;p&gt;After (K9s)&lt;br&gt;
k9s (enter)&lt;/p&gt;

&lt;p&gt;: then ns (switch namespace to myapp)&lt;/p&gt;

&lt;p&gt;Arrow keys to the crashing pod → press l (logs automatically stream)&lt;/p&gt;

&lt;p&gt;Press d (describe) – reads like a man page, but instant&lt;/p&gt;

&lt;p&gt;Press /deploy → select deployment → press s (scale) → type 0 → enter&lt;/p&gt;

&lt;p&gt;No typing resource names. No looking up pod hashes. No --tail. No namespace typos.&lt;/p&gt;

&lt;p&gt;That’s the difference between “managing YAML” and “piloting a cluster.”&lt;/p&gt;

&lt;p&gt;The Features That Stole My Heart&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Vim-like navigation (if you know, you know)&lt;br&gt;
j/k to scroll, / to filter, : for commands, ? for help. It feels like home for anyone who lives in the terminal.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Logs with infinite scroll and live tail&lt;br&gt;
Press l on any pod and watch logs update in real time. Press ctrl-s to save them to a file. Press esc to go back. No ctrl-c | kubectl logs dance.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Popeye integration for cluster health&lt;br&gt;
K9s includes a built-in “Popeye” sanitizer. Press : then popeye and it scans your cluster for misconfigurations, deprecated APIs, and resource waste. It’s like a linter for your entire K8s setup.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Command mode for everything&lt;br&gt;
Want to restart a deployment? Select it, press r. Port-forward? Select pod, press shift-f. Shell into a container? Select pod, press s. Everything is two keystrokes away.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Screens and skins&lt;br&gt;
You can save multiple screens (pods + logs + events side by side) and customize the color theme. I use a purple-and-cyan theme that makes CrashLoopBackOff stand out like a warning light.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Real Talk: Is It Perfect?&lt;br&gt;
No. Nothing is.&lt;/p&gt;

&lt;p&gt;It’s not for scripting – obviously. You’ll still need kubectl in CI/CD.&lt;/p&gt;

&lt;p&gt;Large clusters can lag – if you have 2,000 pods, initial load takes a few seconds.&lt;/p&gt;

&lt;p&gt;The shortcut list is daunting at first. But after a week, you’ll be pressing ctrl-d to kill a pod without thinking.&lt;/p&gt;

&lt;p&gt;But here’s the thing: I don’t want it to replace kubectl. I want it to replace fatigue.&lt;/p&gt;

&lt;p&gt;How to Start (in 60 seconds)&lt;br&gt;
bash&lt;/p&gt;

&lt;h1&gt;
  
  
  macOS
&lt;/h1&gt;

&lt;p&gt;brew install k9s&lt;/p&gt;

&lt;h1&gt;
  
  
  Linux (via snap)
&lt;/h1&gt;

&lt;p&gt;sudo snap install k9s&lt;/p&gt;

&lt;h1&gt;
  
  
  Or download from GitHub releases
&lt;/h1&gt;

&lt;h1&gt;
  
  
  &lt;a href="https://github.com/derailed/k9s/releases" rel="noopener noreferrer"&gt;https://github.com/derailed/k9s/releases&lt;/a&gt;
&lt;/h1&gt;

&lt;p&gt;Then just run:&lt;/p&gt;

&lt;p&gt;bash&lt;br&gt;
k9s&lt;br&gt;
That’s it. It reads your ~/.kube/config. It respects your current context. It just works.&lt;/p&gt;

&lt;p&gt;The Moment I Knew I Was Hooked&lt;br&gt;
We had a production incident. Three microservices failing. A ConfigMap typo. I needed to check logs from service A, events from namespace B, and scale down service C simultaneously.&lt;/p&gt;

&lt;p&gt;Normally, I’d have five terminal tabs open, each with a different kubectl command running.&lt;/p&gt;

&lt;p&gt;With K9s, I opened two splits in the same TUI: logs on the right, pod list on the left. I watched the error appear, fixed the ConfigMap in another window, and saw the pods restart in real time without refreshing anything.&lt;/p&gt;

&lt;p&gt;My lead engineer looked over and said, “What the hell is that? Install it on my machine right now.”&lt;/p&gt;

&lt;p&gt;That’s K9s.&lt;/p&gt;

&lt;p&gt;Final Verdict&lt;br&gt;
Kubernetes is powerful, but power without visibility is just chaos.&lt;/p&gt;

&lt;p&gt;K9s doesn’t make K8s easier — it makes it visible, tactile, and fast. It turns a firehose of YAML and API calls into a dashboard that fits inside a single terminal window.&lt;/p&gt;

&lt;p&gt;If you’re tired of typing kubectl get pods --all-namespaces | grep Pending, do yourself a favor.&lt;/p&gt;

&lt;p&gt;Try K9s for one day.&lt;/p&gt;

&lt;p&gt;You might just fall in love with K8s all over again.&lt;/p&gt;

&lt;p&gt;Bonus tip: Add alias kk='k9s' to your .zshrc and thank me later.&lt;/p&gt;

&lt;p&gt;Have you used K9s? What’s your favorite shortcut? Drop a comment below — I’ll trade you my skin config for yours. 🚀&lt;/p&gt;

</description>
      <category>cli</category>
      <category>kubernetes</category>
      <category>productivity</category>
      <category>tooling</category>
    </item>
    <item>
      <title>Secrets Management Across Multi-Cloud Pipelines</title>
      <dc:creator>Nerav Doshi</dc:creator>
      <pubDate>Mon, 15 Jun 2026 14:51:40 +0000</pubDate>
      <link>https://dev.to/agenticdevops/secrets-management-across-multi-cloud-pipelines-13lf</link>
      <guid>https://dev.to/agenticdevops/secrets-management-across-multi-cloud-pipelines-13lf</guid>
      <description>&lt;p&gt;🛠️ &lt;strong&gt;Pipelines in the Wild #3&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Pipeline &amp;amp; Prompts | Byte size guides on DevOps, Cloud and AI&lt;/em&gt;&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚡ Byte Size Summary&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Secret management failures are invisible until they cause a production incident — start with RBAC and namespace isolation before the first workload goes live&lt;/li&gt;
&lt;li&gt;Storing secrets in a central vault solves the sprawl problem but introduces a new failure mode: rotation lag between the vault and the namespace-level Kubernetes secret&lt;/li&gt;
&lt;li&gt;The real unsolved problem is not technical — it is knowing who owns the approval and escalation path when a credential rotates at 2 AM across a multi-timezone team&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;p&gt;The deployment had been running fine in dev for two days. Same manifests, same pipeline, same container images. We promoted to production and the pods went straight into ImagePullBackOff.&lt;/p&gt;

&lt;p&gt;Not a misconfigured resource limit. Not a broken liveness probe. A pull secret that existed in the dev namespace and nowhere else.&lt;/p&gt;

&lt;p&gt;The registry was internal. The credential was real. Nobody had thought to check whether the secret had been created in the production namespace — because it had been created ad hoc during initial testing, stored on a local notepad, and everyone assumed someone else had handled it for prod.&lt;/p&gt;

&lt;p&gt;What followed was several hours of degraded production, a delayed platform release, and five or six people across multiple time zones working from memory and Slack threads with no runbook in sight. The fix, once identified, took minutes. Finding the fix took hours.&lt;/p&gt;

&lt;p&gt;That incident was the starting point of a long education in secret management. The immediate problem was a missing pull secret in the wrong namespace. The real problem ran deeper — and it took an audit, an enterprise approval process, a failed secret rotation, and one very sharp observation from a more experienced engineer to understand what it actually was.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;In the early stages of a Kubernetes adoption, secrets are almost always an afterthought. The team is focused on getting workloads running, learning the platform, and delivering against commitments. Secrets get created when something fails, stored wherever is convenient, and recreated from memory the next time something breaks.&lt;/p&gt;

&lt;p&gt;This works until it doesn't.&lt;/p&gt;

&lt;p&gt;The failure mode is not just operational — a wrong namespace, a stale credential, a missed rotation. The deeper failure is structural. Kubernetes base64 encoding is not encryption. Any service account with read access to a namespace can retrieve every secret in that namespace and decode the values in seconds. Without RBAC, dev service accounts can read prod database credentials. Without namespace isolation, a misconfigured workload in one environment can inadvertently consume secrets intended for another.&lt;/p&gt;

&lt;p&gt;Platform engineers moving into multi-cloud environments compound this problem. Each cloud has its own native secrets service. Each pipeline has its own credential requirements. Each environment has its own namespace structure. Without a deliberate architecture, secrets sprawl across notepads, environment variables, ConfigMaps used as secret storage, and Git commits that are very hard to fully expunge once they are pushed.&lt;/p&gt;

&lt;p&gt;The incident cost was one day's delay on a significant platform release, discovered manually by a human checking on a deployment that had been quietly failing for hours. There was no alert. No monitor. No automated detection. Just someone who happened to look.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Existing Approaches Fall Short
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Ad hoc secret creation per namespace&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The natural first step. Create the secret where you need it, when you need it. Fast to start, impossible to maintain. Secrets diverge between environments, rotation becomes manual per namespace, and the source of truth is whoever created the secret last.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kubernetes Secrets without RBAC&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Kubernetes Secrets are base64 encoded, not encrypted at rest by default on vanilla Kubernetes. OpenShift 4.x enables etcd encryption for Secrets by default — but without RBAC, any pod's service account with namespace access can still read any secret in that namespace. In a shared cluster with dev and prod namespaces side by side, this is not a theoretical risk — it is a standing exposure that an audit will find immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cluster separation as a security boundary&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Separating prod and dev onto different clusters contains blast radius but does not fix the underlying problem. Ad hoc secrets still get created. Rotation is still manual. Tribal knowledge still owns the recovery path. The incident can no longer cross environments, but within each environment, the same exposure exists.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud-native secrets managers without a sync strategy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Centralizing secrets in a cloud-native vault is the right architectural move. But it introduces a new failure mode that most documentation does not cover: the sync gap. When a secret rotates in the vault, the namespace-level Kubernetes &lt;code&gt;Secret&lt;/code&gt; object is a separate artifact. If the sync between vault and namespace fails — or if the pod is not restarted after a successful sync — the running workload is using a stale credential. The vault shows the rotation succeeded. The pod disagrees.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="/images/diagrams/secrets-management-multi-cloud-pipelines.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/diagrams/secrets-management-multi-cloud-pipelines.png" alt="Secret Management Architecture — Trust Boundaries and Sync Flow"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The diagram above proves one thing: secret management is a routing problem with two distinct failure points — the trust boundary between namespaces, and the sync gap between the central vault and the Kubernetes &lt;code&gt;Secret&lt;/code&gt; object.&lt;/p&gt;

&lt;p&gt;The architecture has three layers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1 — Central Secrets Store&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A cloud-native or self-hosted secrets manager holds the canonical value for every credential. Access to this layer is controlled by service account tokens scoped per environment. No developer has direct write access to production secrets in the central store. The CI/CD pipeline has read-only access, scoped to the secrets it needs for the environment it is deploying to. Human write access to prod secrets requires a break-glass process outside of automated rotation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2 — Sync Operator&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The External Secrets Operator (ESO) runs inside the cluster and watches for changes in the central store. When a rotation event occurs, ESO reconciles the namespace-level Kubernetes &lt;code&gt;Secret&lt;/code&gt; objects. This is the critical seam. If the operator fails, is misconfigured, or runs behind its refresh interval, the Kubernetes secret is stale even though the vault value is current. ESO must be monitored and alerted on — it is a critical path dependency, not background infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 3 — Namespace Isolation with RBAC&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Prod and dev namespaces are isolated with explicit RBAC. Service accounts are scoped to their namespace. The prod service account cannot read dev secrets. The dev service account cannot read prod secrets. This is enforced at the API server level, not by convention.&lt;/p&gt;

&lt;p&gt;The rotation lag problem is architectural, not operational. A pod that started before a secret rotation uses the credential that was mounted at pod startup. Restarting the pod after a confirmed sync is the only way to guarantee the running workload is using the current credential. Without a process that enforces this, rotation and running workload credential state are eventually consistent at best.&lt;/p&gt;




&lt;h2&gt;
  
  
  How It Works: Step by Step
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;OpenShift 4.12+ or Kubernetes 1.26+&lt;/li&gt;
&lt;li&gt;Helm 3.x installed locally&lt;/li&gt;
&lt;li&gt;A central secrets manager — this article covers AWS Secrets Manager (IRSA via STS), Azure Key Vault (Workload Identity), and HashiCorp Vault (Kubernetes auth)&lt;/li&gt;
&lt;li&gt;Cluster-admin access to install the ESO operator and configure RBAC&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 1 — Install the External Secrets Operator
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Add the External Secrets Operator Helm repository&lt;/span&gt;
helm repo add external-secrets https://charts.external-secrets.io
helm repo update

&lt;span class="c"&gt;# Install ESO 0.10.0+ into its own namespace&lt;/span&gt;
&lt;span class="c"&gt;# [AUTHOR TO VALIDATE] — confirm latest stable chart version before repo build&lt;/span&gt;
helm &lt;span class="nb"&gt;install &lt;/span&gt;external-secrets &lt;span class="se"&gt;\&lt;/span&gt;
  external-secrets/external-secrets &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; external-secrets &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--create-namespace&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; &lt;span class="nv"&gt;installCRDs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--version&lt;/span&gt; 0.10.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify the operator is running before proceeding:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;oc get pods &lt;span class="nt"&gt;-n&lt;/span&gt; external-secrets
&lt;span class="c"&gt;# All pods should show Running status before applying any SecretStore or ExternalSecret&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2 — Create a SecretStore scoped to each namespace
&lt;/h3&gt;

&lt;p&gt;A &lt;code&gt;SecretStore&lt;/code&gt; is namespace-scoped. Prod and dev each get their own — they never share one. Choose the provider block that matches your environment.&lt;/p&gt;

&lt;h4&gt;
  
  
  AWS Secrets Manager — IRSA via STS
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# prod-secretstore-aws.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;external-secrets.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SecretStore&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod-secretstore&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;aws&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SecretsManager&lt;/span&gt;
      &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;eu-west-1&lt;/span&gt;  &lt;span class="c1"&gt;# [AUTHOR TO VALIDATE] — set your region&lt;/span&gt;
      &lt;span class="na"&gt;auth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;jwt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;serviceAccountRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod-workload-sa&lt;/span&gt;
            &lt;span class="c1"&gt;# This SA must carry the IAM role annotation — see Step 4&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Annotate the service account with the IAM role ARN:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;oc annotate serviceaccount prod-workload-sa &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; prod &lt;span class="se"&gt;\&lt;/span&gt;
  eks.amazonaws.com/role-arn&lt;span class="o"&gt;=&lt;/span&gt;arn:aws:iam::123456789012:role/prod-secrets-reader
  &lt;span class="c"&gt;# [AUTHOR TO VALIDATE] — replace account ID and role name&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The IAM role requires a trust policy scoped to the cluster OIDC provider and a permissions policy granting &lt;code&gt;secretsmanager:GetSecretValue&lt;/code&gt; against specific secret ARNs — not &lt;code&gt;*&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Azure Key Vault — Workload Identity
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# prod-secretstore-azure.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;external-secrets.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SecretStore&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod-secretstore&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;azurekv&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;authType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;WorkloadIdentity&lt;/span&gt;
      &lt;span class="na"&gt;vaultUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://&amp;lt;YOUR-KEYVAULT-NAME&amp;gt;.vault.azure.net"&lt;/span&gt;
      &lt;span class="c1"&gt;# [AUTHOR TO VALIDATE] — replace with your Key Vault URL&lt;/span&gt;
      &lt;span class="na"&gt;serviceAccountRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod-workload-sa&lt;/span&gt;
        &lt;span class="c1"&gt;# This SA must carry the Workload Identity annotation — see Step 4&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Annotate the service account with the managed identity client ID:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;oc annotate serviceaccount prod-workload-sa &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; prod &lt;span class="se"&gt;\&lt;/span&gt;
  azure.workload.identity/client-id&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;MANAGED_IDENTITY_CLIENT_ID&amp;gt;
  &lt;span class="c"&gt;# [AUTHOR TO VALIDATE] — replace with your managed identity client ID&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The managed identity needs the &lt;code&gt;Key Vault Secrets User&lt;/code&gt; role scoped to the specific Key Vault — not the subscription. The pod spec also requires this label in the Deployment's pod template metadata:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;azure.workload.identity/use&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  HashiCorp Vault — Kubernetes Auth
&lt;/h4&gt;

&lt;p&gt;Kubernetes auth is the recommended starting point for Vault in an OpenShift environment. It uses the pod's projected service account token to authenticate — no static credentials stored anywhere.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# prod-secretstore-vault.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;external-secrets.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SecretStore&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod-secretstore&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;vault&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://vault.internal:8200"&lt;/span&gt;
      &lt;span class="c1"&gt;# [AUTHOR TO VALIDATE] — replace with your Vault server URL&lt;/span&gt;
      &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;secret"&lt;/span&gt;
      &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v2"&lt;/span&gt;  &lt;span class="c1"&gt;# KV v2 is the current default secrets engine&lt;/span&gt;
      &lt;span class="na"&gt;auth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;kubernetes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kubernetes"&lt;/span&gt;
          &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prod-secret-reader"&lt;/span&gt;
          &lt;span class="c1"&gt;# [AUTHOR TO VALIDATE] — replace with your Vault role name&lt;/span&gt;
          &lt;span class="na"&gt;serviceAccountRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod-workload-sa&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Configure the Kubernetes auth backend on Vault once per cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run against your Vault instance — not inside OpenShift&lt;/span&gt;
vault auth &lt;span class="nb"&gt;enable &lt;/span&gt;kubernetes

vault write auth/kubernetes/config &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nv"&gt;kubernetes_host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://&amp;lt;OPENSHIFT_API_SERVER&amp;gt;:6443"&lt;/span&gt;
  &lt;span class="c"&gt;# [AUTHOR TO VALIDATE] — replace with your OpenShift API server URL&lt;/span&gt;

vault write auth/kubernetes/role/prod-secret-reader &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nv"&gt;bound_service_account_names&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;prod-workload-sa &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nv"&gt;bound_service_account_namespaces&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;prod &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nv"&gt;policies&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;prod-secrets-policy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nv"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1h
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a minimal Vault policy scoped to the specific secret path — never use wildcards in prod:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# prod-secrets-policy.hcl&lt;/span&gt;
&lt;span class="nx"&gt;path&lt;/span&gt; &lt;span class="s2"&gt;"secret/data/prod/registry/pull-secret"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;capabilities&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"read"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply the SecretStore manifest for your provider:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;oc apply &lt;span class="nt"&gt;-f&lt;/span&gt; prod-secretstore-aws.yaml    &lt;span class="c"&gt;# if using AWS&lt;/span&gt;
oc apply &lt;span class="nt"&gt;-f&lt;/span&gt; prod-secretstore-azure.yaml  &lt;span class="c"&gt;# if using Azure&lt;/span&gt;
oc apply &lt;span class="nt"&gt;-f&lt;/span&gt; prod-secretstore-vault.yaml  &lt;span class="c"&gt;# if using Vault&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3 — Define an ExternalSecret to sync the pull secret
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;ExternalSecret&lt;/code&gt; fetches individual credential fields from the vault and assembles them into a valid &lt;code&gt;kubernetes.io/dockerconfigjson&lt;/code&gt; secret in the namespace. The template below works for all three providers — only the &lt;code&gt;secretStoreRef&lt;/code&gt; name changes per provider.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# prod-pull-secret-external.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;external-secrets.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ExternalSecret&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;registry-pull-secret&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;refreshInterval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1h&lt;/span&gt;
  &lt;span class="c1"&gt;# Note: 1h means up to 60 minutes rotation lag before the&lt;/span&gt;
  &lt;span class="c1"&gt;# namespace Secret reflects a vault change. Reduce for&lt;/span&gt;
  &lt;span class="c1"&gt;# time-sensitive credentials. Minimum recommended: 15m.&lt;/span&gt;
  &lt;span class="na"&gt;secretStoreRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod-secretstore&lt;/span&gt;   &lt;span class="c1"&gt;# matches whichever SecretStore you applied in Step 2&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SecretStore&lt;/span&gt;
  &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;registry-pull-secret&lt;/span&gt;
    &lt;span class="na"&gt;creationPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Owner&lt;/span&gt;
    &lt;span class="c1"&gt;# Owner means ESO controls the lifecycle of this Secret.&lt;/span&gt;
    &lt;span class="c1"&gt;# If this ExternalSecret is deleted, the Secret is deleted with it.&lt;/span&gt;
    &lt;span class="c1"&gt;# Do not delete ExternalSecrets without understanding this behavior.&lt;/span&gt;
    &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kubernetes.io/dockerconfigjson&lt;/span&gt;
      &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;.dockerconfigjson&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;{&lt;/span&gt;
            &lt;span class="s"&gt;"auths": {&lt;/span&gt;
              &lt;span class="s"&gt;"{{ .registryHost }}": {&lt;/span&gt;
                &lt;span class="s"&gt;"username": "{{ .registryUsername }}",&lt;/span&gt;
                &lt;span class="s"&gt;"password": "{{ .registryPassword }}",&lt;/span&gt;
                &lt;span class="s"&gt;"auth": "{{ printf "%s:%s" .registryUsername .registryPassword | b64enc }}"&lt;/span&gt;
              &lt;span class="s"&gt;}&lt;/span&gt;
            &lt;span class="s"&gt;}&lt;/span&gt;
          &lt;span class="s"&gt;}&lt;/span&gt;
  &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;secretKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;registryHost&lt;/span&gt;
      &lt;span class="na"&gt;remoteRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod/registry/pull-secret&lt;/span&gt;    &lt;span class="c1"&gt;# [AUTHOR TO VALIDATE] — Vault path to your secret&lt;/span&gt;
        &lt;span class="na"&gt;property&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;host&lt;/span&gt;                    &lt;span class="c1"&gt;# [AUTHOR TO VALIDATE] — field name for registry hostname&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;secretKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;registryUsername&lt;/span&gt;
      &lt;span class="na"&gt;remoteRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod/registry/pull-secret&lt;/span&gt;
        &lt;span class="na"&gt;property&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;username&lt;/span&gt;                &lt;span class="c1"&gt;# [AUTHOR TO VALIDATE] — field name for username&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;secretKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;registryPassword&lt;/span&gt;
      &lt;span class="na"&gt;remoteRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod/registry/pull-secret&lt;/span&gt;
        &lt;span class="na"&gt;property&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;password&lt;/span&gt;                &lt;span class="c1"&gt;# [AUTHOR TO VALIDATE] — field name for password&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;oc apply &lt;span class="nt"&gt;-f&lt;/span&gt; prod-pull-secret-external.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify the sync completed and the Secret was created:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;oc get externalsecret registry-pull-secret &lt;span class="nt"&gt;-n&lt;/span&gt; prod
&lt;span class="c"&gt;# STATUS column must show: SecretSynced&lt;/span&gt;
&lt;span class="c"&gt;# READY column must show: True&lt;/span&gt;

&lt;span class="c"&gt;# Confirm the Secret exists and is correctly typed&lt;/span&gt;
oc get secret registry-pull-secret &lt;span class="nt"&gt;-n&lt;/span&gt; prod &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{.type}'&lt;/span&gt;
&lt;span class="c"&gt;# Expected output: kubernetes.io/dockerconfigjson&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If STATUS shows &lt;code&gt;SecretSyncedError&lt;/code&gt;, check the ESO operator logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;oc logs &lt;span class="nt"&gt;-n&lt;/span&gt; external-secrets &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-l&lt;/span&gt; app.kubernetes.io/name&lt;span class="o"&gt;=&lt;/span&gt;external-secrets &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;50
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4 — Apply RBAC to lock down namespace secret access
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# prod-secret-rbac.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rbac.authorization.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Role&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;secret-reader&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod&lt;/span&gt;
&lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;secrets"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;verbs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;resourceNames&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;registry-pull-secret"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="c1"&gt;# Scoped to the named secret only — not wildcard access to all secrets&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rbac.authorization.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RoleBinding&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;workload-secret-reader&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod&lt;/span&gt;
&lt;span class="na"&gt;subjects&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceAccount&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod-workload-sa&lt;/span&gt;
    &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod&lt;/span&gt;
&lt;span class="na"&gt;roleRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Role&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;secret-reader&lt;/span&gt;
  &lt;span class="na"&gt;apiGroup&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rbac.authorization.k8s.io&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;oc apply &lt;span class="nt"&gt;-f&lt;/span&gt; prod-secret-rbac.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This scopes the prod service account to read only the specific named secret it needs. Apply the equivalent for the dev namespace, scoped to dev secrets only. Neither service account should have cross-namespace access.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5 — Reference the secret in your workload
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# prod-deployment.yaml (relevant section)&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;azure.workload.identity/use&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;  &lt;span class="c1"&gt;# include only if using Azure Workload Identity&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;imagePullSecrets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;registry-pull-secret&lt;/span&gt;
      &lt;span class="na"&gt;serviceAccountName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod-workload-sa&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;registry.internal/org/app:latest&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 6 — Handle rotation explicitly
&lt;/h3&gt;

&lt;p&gt;When a credential rotates in the central store, the &lt;code&gt;ExternalSecret&lt;/code&gt; will re-sync within the &lt;code&gt;refreshInterval&lt;/code&gt;. The running pod will not automatically pick up the new credential — it uses the value that was mounted at startup. A rollout restart is required after every confirmed sync.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Confirm the sync has completed before restarting&lt;/span&gt;
oc get externalsecret registry-pull-secret &lt;span class="nt"&gt;-n&lt;/span&gt; prod
&lt;span class="c"&gt;# Confirm: STATUS = SecretSynced and READY = True&lt;/span&gt;

&lt;span class="c"&gt;# Restart the deployment to pick up the rotated credential&lt;/span&gt;
oc rollout restart deployment/app &lt;span class="nt"&gt;-n&lt;/span&gt; prod

&lt;span class="c"&gt;# Verify the rollout completes cleanly&lt;/span&gt;
oc rollout status deployment/app &lt;span class="nt"&gt;-n&lt;/span&gt; prod
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add this as an explicit named step in your rotation runbook — not a footnote. It is not optional and it is not automatic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rollback consideration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If a rotation introduces a bad credential — wrong value, wrong format, access not yet propagated in the provider — roll back the deployment to the previous revision first, then investigate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;oc rollout undo deployment/app &lt;span class="nt"&gt;-n&lt;/span&gt; prod
oc rollout status deployment/app &lt;span class="nt"&gt;-n&lt;/span&gt; prod
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that &lt;code&gt;oc rollout undo&lt;/code&gt; rolls back the deployment configuration, not the secret value. If the vault value itself is wrong, rolling back the deployment buys time but does not fix the underlying problem. Correct the value in the vault first, wait for ESO to re-sync, then trigger a new rollout. Do not attempt to fix the secret in place while the deployment is actively failing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Security and Operational Considerations
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;RBAC is the first thing to configure, not the last&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Kubernetes Secrets are base64 encoded. Any service account with &lt;code&gt;get&lt;/code&gt; or &lt;code&gt;list&lt;/code&gt; access to secrets in a namespace can retrieve and decode every credential stored there. OpenShift 4.x enables etcd encryption for Secrets by default — vanilla Kubernetes does not. Verify your cluster's encryption at rest configuration before assuming the storage layer is protected. Apply &lt;code&gt;Role&lt;/code&gt; and &lt;code&gt;RoleBinding&lt;/code&gt; before the first secret is created in any namespace, and scope them to named resources, not wildcard access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The sync operator is a critical dependency — treat it as one&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once ESO is part of your architecture it is a critical path component. Monitor it. Alert on sync failures. ESO exposes the &lt;code&gt;externalsecret_sync_calls_error&lt;/code&gt; metric — wire this to your alerting platform. A silent sync failure means your workload is running with a stale credential and you will not know until something breaks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check ESO sync status across all ExternalSecrets in a namespace&lt;/span&gt;
oc get externalsecret &lt;span class="nt"&gt;-n&lt;/span&gt; prod
&lt;span class="c"&gt;# Any STATUS other than SecretSynced needs immediate investigation&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The central secrets store itself needs RBAC&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If the engineering team has full read/write access to the secrets manager, the blast radius of a compromised account is the entire vault. Separate write access from read access. Human write access to prod secrets should require a break-glass process outside of automated rotation. Document who holds that access and review it quarterly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;creationPolicy: Owner&lt;/code&gt; has a destructive side effect&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When ESO owns a Secret's lifecycle, deleting the &lt;code&gt;ExternalSecret&lt;/code&gt; deletes the Secret with it. In a multi-team environment, a developer deleting what appears to be a stale or misconfigured &lt;code&gt;ExternalSecret&lt;/code&gt; will drop the credential from the namespace immediately. Make sure your team understands this behavior before granting delete access to &lt;code&gt;ExternalSecret&lt;/code&gt; resources.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Define the rotation approval path before you need it&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the thing that documentation does not cover. When a credential rotates at 2 AM in a multi-cloud environment with a team spread across time zones, who has the authority to approve the rotation in the central store? Who runs the &lt;code&gt;oc rollout restart&lt;/code&gt;? Who confirms the rollout completed cleanly and signs off that prod is healthy?&lt;/p&gt;

&lt;p&gt;Write this down before it happens. Name the people, define the escalation path, and put it somewhere a new team member can find it without a Slack thread.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Audit logs need active review, not passive collection&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most secrets managers generate audit logs for every read and write operation. These logs are only useful if someone is reviewing them. Wire secret access events into your SIEM or log aggregator and create alerts for anomalous patterns — unexpected reads, access from unrecognized service accounts, bulk secret reads that do not match a known pipeline run.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Breaks at Scale
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Rotation lag multiplies across namespaces&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With one namespace and one workload, a manual &lt;code&gt;oc rollout restart&lt;/code&gt; after rotation is manageable. With ten namespaces, thirty deployments, and a rotation event that cascades across dependent credentials, it does not scale. You need a rotation event handler — a pipeline step or operator webhook that triggers a rolling restart of affected workloads automatically after a confirmed sync. This is not a day-one problem. It becomes one at day ninety when the first coordinated rotation happens and nobody has automated the downstream restart.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-cloud secret identity is unsolved by most teams&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In a true multi-cloud deployment — workloads on AWS, Azure, and an on-premises OpenShift cluster all consuming secrets — each cloud has its own identity model for authenticating to the central store. The pipeline service account on AWS uses an IAM role. The OpenShift cluster on-premises uses a service account token projected via OIDC. Keeping these identity bindings consistent, rotated, and auditable across three clouds is an operational challenge that most tooling handles partially at best.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 2 AM problem at scale&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With one team and one cluster, Slack and tribal knowledge is expensive but survivable. With multiple teams, multiple clusters, and a secrets manager that is a shared dependency, a rotation failure at 2 AM is a cross-team incident. The human routing problem — who owns the approval, who runs the restart, who confirms health across environments — does not get easier with scale. It gets harder. The runbook is not optional at this point. It is the difference between a thirty-minute recovery and a three-hour incident bridge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regulated environments add approval gates to the rotation path&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In financial services or healthcare environments, credential rotation often requires a change approval before the rotation runs, not just after. This means the automated rotation flow needs to integrate with your change management tooling — a ServiceNow ticket, a Jira issue, an approval gate in the pipeline. The technical implementation is straightforward. Getting it through the approval process for a new tooling integration is the actual work.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;Start with encrypted Git secrets before the first workload enters a namespace. Not as the end state — as the minimum bar that establishes the habit. Leaked Git history is incredibly difficult to clean completely. An encrypted Git secret is easy to upgrade to an enterprise vault later. And it builds a security-first mindset within the engineering team from day one, before there is an incident to justify it.&lt;/p&gt;

&lt;p&gt;The harder lesson: define the rotation runbook before the first secret is created in prod, not after the first rotation failure. The technical architecture is the easy part. Knowing who clicks approve at 2 AM is what breaks in production — and no documentation covers it because it is a people and process problem, not a Kubernetes problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Recap
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RBAC first, secrets second&lt;/strong&gt; — configure namespace-level RBAC before the first secret is created; base64 encoding is not access control, and etcd encryption at rest is not enabled by default on vanilla Kubernetes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The sync gap is the rotation failure&lt;/strong&gt; — a successful rotation in your central vault does not mean running pods are using the new credential; an explicit rollout restart after a confirmed ESO sync is required and must be in the runbook&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secret management is a human routing problem&lt;/strong&gt; — the technical architecture is solvable; who owns the 2 AM approval and the cross-timezone escalation path is what breaks in production&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  GitHub Repo
&lt;/h2&gt;

&lt;p&gt;Full implementation with working manifests for all three providers, RBAC templates, and rotation runbook:&lt;/p&gt;

&lt;p&gt;[PLACEHOLDER — repo content in progress: pipelineandprompts-labs/secrets-management-multi-cloud]&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;Secret management is one half of the pipeline security conversation. The other half is what happens when the pipeline itself is the attack surface — supply chain security, signed commits, and verifying that the image running in prod is exactly the image that passed your tests.&lt;/p&gt;

&lt;p&gt;Next in Pipelines in the Wild: &lt;strong&gt;Pipeline Supply Chain Security — Signing, Provenance, and Why Your CI/CD Pipeline is a Target.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by Pipeline &amp;amp; Prompts | Byte size guides on DevOps, Cloud and AI&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Found this useful? Share it with the engineer on your team who is still creating secrets manually — and forward it to whoever owns the rotation runbook. If there is no rotation runbook, this article is for them.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>secretsmanagement</category>
      <category>openshift</category>
      <category>kubernetes</category>
      <category>pipelinesinthewild</category>
    </item>
    <item>
      <title>MCP Server Architecture for Platform Teams — Giving AI Live Access to Your Infrastructure</title>
      <dc:creator>Nerav Doshi</dc:creator>
      <pubDate>Mon, 15 Jun 2026 13:43:44 +0000</pubDate>
      <link>https://dev.to/agenticdevops/mcp-server-architecture-for-platform-teams-giving-ai-live-access-to-your-infrastructure-3n76</link>
      <guid>https://dev.to/agenticdevops/mcp-server-architecture-for-platform-teams-giving-ai-live-access-to-your-infrastructure-3n76</guid>
      <description>&lt;p&gt;&lt;em&gt;Pipeline &amp;amp; Prompts | Byte size guides on DevOps, Cloud and AI&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI in the Stack #3&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚡ Byte Size Summary&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MCP (Model Context Protocol) is the standard that lets AI agents interact with external systems — your cluster, your observability stack, your ticketing system — without bespoke integration code for every tool.&lt;/li&gt;
&lt;li&gt;MCP directly addresses AI hallucination and 2AM incident response by grounding AI answers in live system state. It does not solve tribal knowledge alone — that needs RAG alongside it.&lt;/li&gt;
&lt;li&gt;This article covers the production-grade architecture: what MCP servers are, how to design them for platform engineering use cases, and what you need to get right before running them anywhere near production.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;




&lt;p&gt;In logistics, the hardest problems rarely come from missing data.&lt;/p&gt;

&lt;p&gt;They come from disconnected systems.&lt;/p&gt;

&lt;p&gt;The warehouse knows one thing. The transportation management system knows another. Inventory systems lag behind reality by hours. Operators work around the gaps manually — copying numbers between screens, making calls to confirm what the system should already know, carrying context in their heads because no single system has the full picture.&lt;/p&gt;

&lt;p&gt;I spent years watching intelligent people solve problems that should not have existed, because the systems around them were designed to optimise locally rather than coordinate globally. The data was there. The capability was there. The coordination layer was not.&lt;/p&gt;

&lt;p&gt;Modern infrastructure operations feel surprisingly similar.&lt;/p&gt;

&lt;p&gt;Your Kubernetes cluster knows the state of every pod. Your observability stack knows the error rates and latency trends. Your ticketing system knows what changes were deployed in the last 24 hours. Your CI/CD pipeline knows what is currently in flight. And your AI assistant — the tool you are increasingly asking to help you reason about incidents — knows none of it, unless you paste it in manually.&lt;/p&gt;

&lt;p&gt;Model Context Protocol is the coordination layer that changes this. Not by giving AI access to everything at once, but by giving it a structured, auditable, controlled way to request the context it needs, from the systems that have it, at the moment it needs it.&lt;/p&gt;

&lt;p&gt;That is what this article is about.&lt;/p&gt;




&lt;h2&gt;
  
  
  What MCP Actually Is
&lt;/h2&gt;

&lt;p&gt;Model Context Protocol (MCP) is an open standard, introduced by Anthropic, that defines how AI models communicate with external tools and data sources. Think of it as a common language that sits between an AI assistant and the systems it needs to interact with.&lt;/p&gt;

&lt;p&gt;Before MCP, every AI integration was bespoke. You wanted your LLM to query your Kubernetes cluster? Write a custom function. You wanted it to check PagerDuty? Write another one. You wanted it to search your runbooks and open a Jira ticket? Three separate integrations, all maintained independently, all breaking in different ways when APIs change.&lt;/p&gt;

&lt;p&gt;MCP replaces that with a standard. An MCP server exposes a set of &lt;strong&gt;tools&lt;/strong&gt; — defined capabilities the AI can invoke — plus &lt;strong&gt;resources&lt;/strong&gt; — data it can read. The AI client (Claude, Cursor, any MCP-compatible host) discovers what tools are available, decides which to call based on the user's question, calls them, and incorporates the results into its response.&lt;/p&gt;

&lt;p&gt;&lt;a href="/images/diagrams/mcp-server-flow.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/diagrams/mcp-server-flow.png" alt="Platform MCP Server Workflow"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The AI does not have direct access to your systems. It has access to an MCP server that mediates that access. That distinction matters enormously for security and governance — which is why this article spends as much time on architecture as on implementation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Platform Engineers Should Care
&lt;/h2&gt;

&lt;p&gt;The RAG pipeline from &lt;a href="https://dev.to/posts/cicd-pipelines-code-to-realworld/"&gt;Article 02&lt;/a&gt; was useful for static knowledge — runbooks, documentation, past incident reports. MCP is useful for live state.&lt;/p&gt;

&lt;p&gt;When an engineer asks "what is causing the latency spike in the payments service right now?" — that is not a runbook question. It requires current pod status, recent deployment events, live error rates, and possibly the last three alerts that fired. None of that lives in a document. All of it lives in systems your MCP server can reach.&lt;/p&gt;

&lt;p&gt;The distinction between what MCP solves and what it does not matters before you design anything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI hallucination — yes, directly.&lt;/strong&gt; Hallucination happens when an LLM answers from training data instead of ground truth. MCP forces the AI to retrieve live, authoritative state before responding. It does not eliminate hallucination entirely — an LLM can still misinterpret what it retrieves — but it directly attacks the root cause for infrastructure questions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2AM incidents — yes, directly.&lt;/strong&gt; This is the primary operational use case. Instead of an engineer manually checking five systems in sequence while half-asleep, an AI with MCP access can pull pod status, recent events, and active alerts in a single query and reason across all of it simultaneously. Speed and context at the moment they are hardest to find.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Too many dashboards — partially.&lt;/strong&gt; MCP does not reduce the number of dashboards in your environment. It gives an AI a way to query across the systems those dashboards represent, so an engineer asks one question instead of navigating five screens. The dashboards still exist. You stop having to drive them manually during an incident.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tribal knowledge — not alone.&lt;/strong&gt; MCP surfaces what your systems know. It does not surface what your team knows — the undocumented context that lives in people's heads, the runbook that exists nowhere in any system, the reason a service is named what it is. That is a RAG problem. The combination of RAG (for historical and human knowledge) and MCP (for live system state) is where the tribal knowledge gap actually starts to close. Neither alone is sufficient.&lt;/p&gt;

&lt;p&gt;An AI that can read your runbooks and query your cluster simultaneously is a meaningful operational tool. An AI that can only do one of those things is a limited one.&lt;/p&gt;




&lt;h2&gt;
  
  
  MCP Server Architecture for Platform Engineering
&lt;/h2&gt;

&lt;p&gt;A production-grade MCP server for a platform team has four layers:&lt;/p&gt;

&lt;p&gt;&lt;a href="/images/diagrams/mcp-server-architecture-platform-engineering-kubernetes.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/diagrams/mcp-server-architecture-platform-engineering-kubernetes.png" alt="Platform MCP Server Architecture"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every tool invocation travels this path: the AI client sends a request, the Auth Gateway validates identity before anything reaches your infrastructure, the MCP server processes it through governance and audit controls, and the Kubernetes API Server enforces access policy independently of the application layer. Two enforcement gates — not one. That is the architecture the implementation sections below are built around.&lt;/p&gt;

&lt;p&gt;The four layers in code:&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 1 — Governance First
&lt;/h2&gt;

&lt;p&gt;Before writing a single tool definition, decide and enforce these three things:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Read-only by default.&lt;/strong&gt; Every tool that touches production infrastructure should be read-only unless you have explicitly designed the write path with human approval steps. An MCP server that can &lt;code&gt;kubectl delete&lt;/code&gt; anything is an incident waiting to happen. Start with read, earn trust, expand deliberately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Audit logging.&lt;/strong&gt; Every tool call should be logged with: timestamp, tool name, input parameters, calling session identity, and response status. This is your audit trail when something goes wrong. It is also how you demonstrate to your security team that AI is not a black box.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rate limiting.&lt;/strong&gt; An AI in an agentic loop can call tools hundreds of times in seconds. Without rate limiting, a runaway agent can exhaust your Kubernetes API quota, spam your ticketing system, or trigger alert storms in your observability stack. Set per-session and per-tool limits before you deploy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 2 — Backend Clients
&lt;/h2&gt;

&lt;p&gt;The MCP server needs clients for each system it connects to. Keep these thin — their job is to call APIs and return structured data, not to contain business logic.&lt;/p&gt;

&lt;p&gt;For a Kubernetes-connected MCP server, using the official &lt;code&gt;kubernetes&lt;/code&gt; Python client:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# k8s_client.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kubernetes&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;KubernetesClient&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;in_cluster&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;in_cluster&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_incluster_config&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_kube_config&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;CoreV1Api&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apps_v1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AppsV1Api&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_pod_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pod_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;pod&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_namespaced_pod&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pod_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;namespace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;phase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;phase&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;conditions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conditions&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
            &lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;container_statuses&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ready&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ready&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;restart_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;restart_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;state&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cs&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;container_statuses&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
            &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;list_failing_pods&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;pods&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_namespaced_pod&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;pods&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_pod_for_all_namespaces&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="n"&gt;failing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pod&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pods&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;phase&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Running&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Succeeded&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="n"&gt;failing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;namespace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;phase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;phase&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;
                &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;failing&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_recent_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_namespaced_event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;limit&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;involved_object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;involved_object&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;last_timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_timestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_timestamp&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Layer 3 — Tool Definitions
&lt;/h2&gt;

&lt;p&gt;This is the layer the AI interacts with directly. Tool descriptions are not just documentation — they are what the LLM reads to decide whether to call the tool and how to format its inputs. Write them precisely.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tools.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp.server&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Server&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Tool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TextContent&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;k8s_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KubernetesClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;audit&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;log_tool_call&lt;/span&gt;

&lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;k8s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KubernetesClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;in_cluster&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Set True when running inside the cluster
&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;register_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Server&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="nd"&gt;@server.list_tools&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;list_tools&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="nc"&gt;Tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_pod_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Get the current status of a specific Kubernetes pod, including phase, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;readiness conditions, container states, and restart counts. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Use this when investigating why a specific pod is unhealthy or not ready.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="n"&gt;inputSchema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;namespace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The Kubernetes namespace the pod is in&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                        &lt;span class="p"&gt;},&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pod_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The exact name of the pod&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                        &lt;span class="p"&gt;}&lt;/span&gt;
                    &lt;span class="p"&gt;},&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;namespace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pod_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nc"&gt;Tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list_failing_pods&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;List all pods that are not in Running or Succeeded state across the cluster &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;or within a specific namespace. Use this as a first step when an incident &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;is reported and you need to identify which pods are affected.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="n"&gt;inputSchema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;namespace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Optional: filter to a specific namespace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                        &lt;span class="p"&gt;}&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nc"&gt;Tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_recent_events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Retrieve recent Kubernetes events for a namespace, ordered by most recent first. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Events capture warnings, errors, and state changes. Use this to understand &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;what happened in the cluster leading up to an issue.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="n"&gt;inputSchema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;namespace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The namespace to retrieve events from&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                        &lt;span class="p"&gt;},&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;limit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;integer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Maximum number of events to return (default 20)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;
                        &lt;span class="p"&gt;}&lt;/span&gt;
                    &lt;span class="p"&gt;},&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;namespace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="nd"&gt;@server.call_tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;log_tool_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Always audit first
&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_pod_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;k8s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_pod_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;namespace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="n"&gt;pod_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pod_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list_failing_pods&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;k8s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_failing_pods&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;namespace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_recent_events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;k8s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_recent_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;namespace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;limit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;TextContent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unknown tool: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;TextContent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))]&lt;/span&gt;

        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tool &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;TextContent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tool execution failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Layer 4 — Transport and Auth
&lt;/h2&gt;

&lt;p&gt;MCP supports two transport modes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;stdio&lt;/strong&gt; — the server runs as a subprocess of the AI client. Simple, local, no network exposure. Right for developer workstations and local tooling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HTTP with SSE (Server-Sent Events)&lt;/strong&gt; — the server runs as a persistent service, reachable over the network. Required for shared team tooling, remote access, and running inside a cluster. For production deployments, SSE transport with mutual TLS (mTLS) is the hardened path; API key authentication is acceptable for internal cluster traffic with network policy controls in place.&lt;/p&gt;

&lt;p&gt;For a platform team MCP server running on Kubernetes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# main.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp.server&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Server&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp.server.sse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SseServerTransport&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;starlette.applications&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Starlette&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;starlette.routing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Route&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;starlette.middleware&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Middleware&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;starlette.middleware.base&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseHTTPMiddleware&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;register_tools&lt;/span&gt;

&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;basicConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INFO&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Server&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;platform-mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;register_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;APIKeyMiddleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseHTTPMiddleware&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;dispatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;call_next&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X-API-Key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;EXPECTED_API_KEY&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Load from env, not hardcoded
&lt;/span&gt;            &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;starlette.responses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;JSONResponse&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;JSONResponse&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unauthorised&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;401&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;call_next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="n"&gt;transport&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SseServerTransport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_sse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect_sse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;receive&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_send&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;streams&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;streams&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;streams&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_initialization_options&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Starlette&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;routes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;Route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/sse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;handle_sse&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
    &lt;span class="n"&gt;middleware&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;Middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;APIKeyMiddleware&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Kubernetes Deployment
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# k8s/deployment.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;platform-mcp-server&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;platform-tools&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;platform-mcp-server&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;platform-mcp-server&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;serviceAccountName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;platform-mcp-sa&lt;/span&gt;  &lt;span class="c1"&gt;# Read-only SA — see RBAC below&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mcp-server&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;your-registry/platform-mcp:latest&lt;/span&gt;
          &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
          &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;MCP_API_KEY&lt;/span&gt;
              &lt;span class="na"&gt;valueFrom&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;secretKeyRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;platform-mcp-secrets&lt;/span&gt;
                  &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-key&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="c1"&gt;# k8s/rbac.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rbac.authorization.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterRole&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;platform-mcp-reader&lt;/span&gt;
&lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pods"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;events"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;namespaces"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nodes"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;verbs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;watch"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;   &lt;span class="c1"&gt;# Read-only — no create, update, delete&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;apps"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deployments"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;replicasets"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;verbs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;watch"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rbac.authorization.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterRoleBinding&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;platform-mcp-reader-binding&lt;/span&gt;
&lt;span class="na"&gt;subjects&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceAccount&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;platform-mcp-sa&lt;/span&gt;
    &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;platform-tools&lt;/span&gt;
&lt;span class="na"&gt;roleRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterRole&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;platform-mcp-reader&lt;/span&gt;
  &lt;span class="na"&gt;apiGroup&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rbac.authorization.k8s.io&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The RBAC configuration enforces the governance constraint at the Kubernetes level — not just in application code. Even if a bug in the tool definitions allowed a write operation to reach the Kubernetes client, the service account has no permission to execute it.&lt;/p&gt;

&lt;p&gt;Defence in depth. Not one gate — two.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Unlocks
&lt;/h2&gt;

&lt;p&gt;With a platform MCP server running, a Claude-powered assistant can handle questions like these using live cluster data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;"What pods are failing in the payments namespace right now?"&lt;/em&gt; → calls &lt;code&gt;list_failing_pods&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;"Why did the checkout service restart three times this morning?"&lt;/em&gt; → calls &lt;code&gt;get_pod_status&lt;/code&gt; + &lt;code&gt;get_recent_events&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;"Is there anything unusual happening across the cluster before I deploy?"&lt;/em&gt; → calls &lt;code&gt;list_failing_pods&lt;/code&gt; across all namespaces&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the coordination layer the opening story was pointing at. In logistics, the fix for disconnected systems was never better dashboards — it was a shared integration layer that let every system speak to every other system through a common protocol. MCP is that layer for AI and infrastructure.&lt;/p&gt;

&lt;p&gt;Combined with the RAG pipeline from Article 02, the same assistant can cross-reference live cluster state against your runbooks — returning answers grounded in documentation and informed by current reality simultaneously. That is the operational use case MCP was built for.&lt;/p&gt;




&lt;h2&gt;
  
  
  What to Build Next
&lt;/h2&gt;

&lt;p&gt;The server in this article covers Kubernetes read operations. The natural extensions, covered in the &lt;a href="https://github.com/agentic-devops/pipelineandprompts-labs/tree/main/mcp-for-kubernetes" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt;, are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prometheus integration&lt;/strong&gt; — add a &lt;code&gt;get_metrics&lt;/code&gt; tool that queries PromQL (Prometheus Query Language) and returns current error rates and latency percentiles&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PagerDuty integration&lt;/strong&gt; — add &lt;code&gt;get_active_incidents&lt;/code&gt; and &lt;code&gt;get_recent_alerts&lt;/code&gt; tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write operations with human approval&lt;/strong&gt; — a &lt;code&gt;restart_pod&lt;/code&gt; tool that creates a Jira ticket and waits for human sign-off before executing; this is the governance pattern that makes agentic write operations safe in production&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The write operation pattern — where the AI prepares an action, a human approves it, and the MCP server executes — is covered in Article 05 of this series.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Article 04 — Prompt Versioning in Production: Treat Prompts Like Infrastructure Artifacts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;System prompts are configuration. Changing them without version control, testing, or rollback strategy is the same mistake engineers made with infrastructure before Terraform existed. Next: how to version, test, and deploy prompts with the same discipline you apply to everything else in your stack.&lt;/p&gt;

</description>
      <category>platformengineering</category>
      <category>kubernetes</category>
      <category>devops</category>
      <category>aiinthestack</category>
    </item>
    <item>
      <title>How Do You Integrate Penetration Testing into CI/CD?</title>
      <dc:creator>varun varde</dc:creator>
      <pubDate>Mon, 15 Jun 2026 12:50:55 +0000</pubDate>
      <link>https://dev.to/varunvarde/how-do-you-integrate-penetration-testing-into-cicd-1795</link>
      <guid>https://dev.to/varunvarde/how-do-you-integrate-penetration-testing-into-cicd-1795</guid>
      <description>&lt;p&gt;Modern software delivery pipelines can deploy code dozens or even hundreds of times per day. Traditional penetration testing models, where security teams perform assessments quarterly or before major releases, simply cannot keep pace.&lt;/p&gt;

&lt;p&gt;Attackers do not wait for the next security review.&lt;/p&gt;

&lt;p&gt;Every pull request, dependency update, infrastructure change, or container image introduces potential risk. Integrating penetration testing into CI/CD enables organizations to identify vulnerabilities before they reach production.&lt;/p&gt;

&lt;p&gt;The goal is not replacing human penetration testers. The goal is automating everything that can be automated so security experts can focus on complex attack paths and business logic flaws.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding Security Testing Layers in CI/CD
&lt;/h2&gt;

&lt;p&gt;Security testing is often misunderstood because multiple categories overlap.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Testing Type&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SAST&lt;/td&gt;
&lt;td&gt;Analyze source code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SCA&lt;/td&gt;
&lt;td&gt;Detect vulnerable dependencies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DAST&lt;/td&gt;
&lt;td&gt;Test running applications&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IAST&lt;/td&gt;
&lt;td&gt;Runtime security analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Penetration Testing&lt;/td&gt;
&lt;td&gt;Simulate attacker behavior&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Penetration testing combines elements of all these approaches.&lt;/p&gt;

&lt;p&gt;A mature CI/CD pipeline continuously performs automated penetration testing while reserving manual testing for sophisticated attack scenarios.&lt;/p&gt;

&lt;h2&gt;
  
  
  Designing a Security-First CI/CD Architecture
&lt;/h2&gt;

&lt;p&gt;A security-centric pipeline typically looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Developer Commit
      ↓
Pre-Commit Security Checks
      ↓
Pull Request Validation
      ↓
Build Stage
      ↓
Container Security Scan
      ↓
Infrastructure Validation
      ↓
Deploy to Staging
      ↓
Automated Penetration Testing
      ↓
Security Gate
      ↓
Production Deployment
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each stage eliminates vulnerabilities before they become more expensive to fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 1: Pre-Commit Security Controls
&lt;/h2&gt;

&lt;p&gt;The cheapest vulnerability is the one that never reaches Git.&lt;/p&gt;

&lt;h3&gt;
  
  
  Secret Detection
&lt;/h3&gt;

&lt;p&gt;Install TruffleHog or Gitleaks before code reaches the repository.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;repos&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;repo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/gitleaks/gitleaks&lt;/span&gt;
  &lt;span class="na"&gt;rev&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v8.20.0&lt;/span&gt;
  &lt;span class="na"&gt;hooks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gitleaks&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Developer installation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;pre-commit

pre-commit &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now every commit is automatically scanned.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dependency Security Validation
&lt;/h2&gt;

&lt;p&gt;Use dependency auditing tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Python&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;pip-audit

pip-audit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Node.js&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm audit &lt;span class="nt"&gt;--production&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Go&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;govulncheck ./...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Stage 2: Automated Penetration Testing in Pull Requests
&lt;/h2&gt;

&lt;p&gt;Pull requests provide the earliest opportunity to validate attack surfaces.&lt;/p&gt;

&lt;h3&gt;
  
  
  Authentication Testing
&lt;/h3&gt;

&lt;p&gt;Example automated API validation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://staging.example.com/api/admin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;401&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;403&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This simple test catches accidental authorization bypasses.&lt;/p&gt;

&lt;h3&gt;
  
  
  Authorization Validation
&lt;/h3&gt;

&lt;p&gt;Testing role separation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;admin_token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_admin_token&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;user_token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_user_token&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;admin&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;admin_token&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_token&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;admin&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;
&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;403&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These tests often discover privilege escalation vulnerabilities before release.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 3: Dynamic Application Security Testing (DAST)
&lt;/h2&gt;

&lt;p&gt;DAST evaluates a running application exactly as attackers would.&lt;/p&gt;

&lt;h2&gt;
  
  
  OWASP ZAP Automation
&lt;/h2&gt;

&lt;p&gt;Deploy the application into a temporary staging environment.&lt;/p&gt;

&lt;p&gt;GitHub Actions example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;OWASP ZAP Scan&lt;/span&gt;
  &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;zaproxy/action-full-scan@v0.10.0&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://staging.example.com"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Baseline Scan
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-t&lt;/span&gt; owasp/zap2docker-stable zap-baseline.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-t&lt;/span&gt; https://staging.example.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Findings may include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Missing security headers&lt;/li&gt;
&lt;li&gt;XSS exposure&lt;/li&gt;
&lt;li&gt;Cookie weaknesses&lt;/li&gt;
&lt;li&gt;Directory traversal&lt;/li&gt;
&lt;li&gt;Information disclosure&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Stage 4: API Penetration Testing Automation
&lt;/h2&gt;

&lt;p&gt;APIs have become the primary attack surface.&lt;/p&gt;

&lt;h3&gt;
  
  
  OpenAPI Security Testing
&lt;/h3&gt;

&lt;p&gt;Tools like Schemathesis can automatically generate attack cases.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;schemathesis run openapi.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--base-url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;https://staging.example.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Generated tests include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Invalid inputs&lt;/li&gt;
&lt;li&gt;Boundary conditions&lt;/li&gt;
&lt;li&gt;Injection attempts&lt;/li&gt;
&lt;li&gt;Authentication failures&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Fuzz Testing
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;schemathesis run &lt;span class="se"&gt;\&lt;/span&gt;
  openapi.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--checks&lt;/span&gt; all
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This uncovers hidden edge cases frequently missed by developers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 5: Container and Kubernetes Penetration Testing
&lt;/h2&gt;

&lt;p&gt;Containers introduce a different attack surface.&lt;/p&gt;

&lt;h3&gt;
  
  
  Container Image Scanning
&lt;/h3&gt;

&lt;p&gt;Trivy example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;trivy image myapp:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CI integration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Container Scan&lt;/span&gt;
  &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aquasecurity/trivy-action@master&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image-ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp:latest&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HIGH,CRITICAL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Kubernetes Security Assessment
&lt;/h3&gt;

&lt;p&gt;Using Kubescape:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubescape scan framework nsa
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Checks include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Privileged containers&lt;/li&gt;
&lt;li&gt;Host networking&lt;/li&gt;
&lt;li&gt;Excessive capabilities&lt;/li&gt;
&lt;li&gt;Missing security contexts&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Stage 6: Infrastructure Penetration Testing
&lt;/h2&gt;

&lt;p&gt;Infrastructure is code. It should be tested like code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Terraform Security Validation
&lt;/h3&gt;

&lt;p&gt;Checkov example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;checkov &lt;span class="nt"&gt;-d&lt;/span&gt; terraform/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GitHub Actions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Terraform Security Scan&lt;/span&gt;
  &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bridgecrewio/checkov-action@master&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;directory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;terraform&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Cloud Configuration Testing
&lt;/h2&gt;

&lt;p&gt;AWS example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;prowler aws
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Findings include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Public S3 buckets&lt;/li&gt;
&lt;li&gt;Weak IAM policies&lt;/li&gt;
&lt;li&gt;Missing encryption&lt;/li&gt;
&lt;li&gt;Excessive permissions&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Integrating Security Gates into CI/CD
&lt;/h2&gt;

&lt;p&gt;Not every vulnerability should block deployment.&lt;/p&gt;

&lt;p&gt;A practical policy:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Severity&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Critical&lt;/td&gt;
&lt;td&gt;Fail build&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Fail production deployment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Ticket creation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Backlog&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;GitHub Actions gate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;- name: Fail on Critical Issues
  run: |
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CRITICAL_FINDINGS&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-gt&lt;/span&gt; 0 &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
      &lt;/span&gt;&lt;span class="nb"&gt;exit &lt;/span&gt;1
    &lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Reporting, Dashboards, and Vulnerability Management
&lt;/h2&gt;

&lt;p&gt;Security data scattered across tools creates chaos.&lt;/p&gt;

&lt;p&gt;Centralize findings.&lt;/p&gt;

&lt;p&gt;Popular platforms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DefectDojo&lt;/li&gt;
&lt;li&gt;Security Hub&lt;/li&gt;
&lt;li&gt;Grafana&lt;/li&gt;
&lt;li&gt;ELK Stack&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example DefectDojo import:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Token &lt;/span&gt;&lt;span class="nv"&gt;$TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-F&lt;/span&gt; &lt;span class="s2"&gt;"file=@zap-report.xml"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  https://defectdojo.example.com/api/v2/import-scan/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Building a Production-Ready DevSecOps Pipeline
&lt;/h2&gt;

&lt;p&gt;Complete GitHub Actions workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Continuous Security Testing&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;

  &lt;span class="na"&gt;sast&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Semgrep&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;returntocorp/semgrep-action@v1&lt;/span&gt;

  &lt;span class="na"&gt;dependency-scan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Trivy FS&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;trivy fs .&lt;/span&gt;

  &lt;span class="na"&gt;container-scan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build Image&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker build -t app:${{ github.sha }} .&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Scan Image&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;trivy image app:${{ github.sha }}&lt;/span&gt;

  &lt;span class="na"&gt;iac-scan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkov&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;checkov -d terraform/&lt;/span&gt;

  &lt;span class="na"&gt;dast&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;sast&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;dependency-scan&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;container-scan&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;iac-scan&lt;/span&gt;

    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;OWASP ZAP&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;zaproxy/action-full-scan@v0.10.0&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://staging.example.com&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This provides comprehensive automated penetration testing from commit to deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes and Lessons Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Treating DAST as Penetration Testing
&lt;/h3&gt;

&lt;p&gt;DAST is valuable but incomplete. Human attackers exploit business logic flaws that scanners cannot detect.&lt;/p&gt;

&lt;h3&gt;
  
  
  Running Every Scan on Every Commit
&lt;/h3&gt;

&lt;p&gt;Excessive scanning creates developer fatigue.&lt;/p&gt;

&lt;p&gt;Use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fast scans on pull requests&lt;/li&gt;
&lt;li&gt;Full penetration testing on staging&lt;/li&gt;
&lt;li&gt;Deep assessments before production&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Ignoring Authentication Testing
&lt;/h3&gt;

&lt;p&gt;Many breaches result from authorization flaws rather than software vulnerabilities.&lt;/p&gt;

&lt;p&gt;Focus heavily on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RBAC validation&lt;/li&gt;
&lt;li&gt;Token abuse testing&lt;/li&gt;
&lt;li&gt;API authorization checks&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Failing to Prioritize Findings
&lt;/h3&gt;

&lt;p&gt;Thousands of low-risk findings provide little value.&lt;/p&gt;

&lt;p&gt;Security teams should prioritize:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Critical&lt;/li&gt;
&lt;li&gt;High&lt;/li&gt;
&lt;li&gt;Exploitable&lt;/li&gt;
&lt;li&gt;Internet-facing&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Everything else comes later.&lt;/p&gt;

&lt;p&gt;Integrating penetration testing into CI/CD transforms security from a periodic activity into a continuous engineering practice. By combining pre-commit validation, SAST, dependency scanning, container assessment, infrastructure testing, API security analysis, and automated DAST, organizations can identify vulnerabilities at the earliest possible stage.&lt;/p&gt;

&lt;p&gt;The strongest DevSecOps programs do not rely on a single security tool. They build layered defenses throughout the entire software delivery lifecycle, ensuring that every commit, build, deployment, and infrastructure change is evaluated through an attacker's lens before it reaches production.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cicd</category>
      <category>webdev</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Network Policies with Calico: Default Deny and Namespace Isolation</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Mon, 15 Jun 2026 12:38:04 +0000</pubDate>
      <link>https://dev.to/futhgar/network-policies-with-calico-default-deny-and-namespace-isolation-1p63</link>
      <guid>https://dev.to/futhgar/network-policies-with-calico-default-deny-and-namespace-isolation-1p63</guid>
      <description>&lt;p&gt;A default-deny NetworkPolicy is five lines of spec. Those five lines will also kill DNS resolution for every pod they select, because an egress deny blocks UDP packets to kube-dns just as happily as it blocks the traffic you were actually worried about. The distance between "I understand network policies" and "I rolled out default deny without an outage" is mostly three blind spots: DNS, your ingress controller, and admission webhooks.&lt;/p&gt;

&lt;p&gt;Out of the box, Kubernetes runs a flat pod network. Every pod can open a connection to every other pod in the cluster, across namespaces, no questions asked. If you've already done the work of &lt;a href="https://guatulabs.dev/posts/kubernetes-rbac-building-least-privilege-service-accounts/" rel="noopener noreferrer"&gt;building least-privilege service accounts&lt;/a&gt;, a flat network is the same problem one layer down: identity is locked tight while the network is wide open. This post is about closing that gap with Calico on a bare-metal cluster (K8s 1.31, Calico 3.x), in an order that doesn't take the cluster down while you do it.&lt;/p&gt;

&lt;p&gt;One prerequisite worth stating plainly: the NetworkPolicy API objects exist in every cluster, but they do nothing unless your CNI enforces them. Calico does. If you're on a CNI without policy support, you can apply these manifests all day and traffic flows anyway, which is its own special category of false confidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rollout that looks right and isn't
&lt;/h2&gt;

&lt;p&gt;The tempting approach goes like this: write one default-deny policy, template it across every namespace, apply, done. Security checkbox ticked before lunch.&lt;/p&gt;

&lt;p&gt;Here's the policy everyone starts with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NetworkPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default-deny&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;team-a&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;        &lt;span class="c1"&gt;# selects every pod in the namespace&lt;/span&gt;
  &lt;span class="na"&gt;policyTypes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Egress&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The empty &lt;code&gt;podSelector&lt;/code&gt; selects all pods, and listing both policy types makes them isolated in both directions. Correct, minimal, and the moment it lands cluster-wide, three things break in a predictable order.&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure one: DNS dies first, and it dies slowly
&lt;/h3&gt;

&lt;p&gt;Every pod in a selected namespace loses the ability to resolve names, because queries to kube-dns in &lt;code&gt;kube-system&lt;/code&gt; are egress traffic like any other. The nasty part is the failure mode. Connections to a denied endpoint fail fast with a timeout you'll notice. DNS failures look different: each lookup waits out a 5-second timeout per attempt, multiplied by the search domain list your &lt;code&gt;ndots&lt;/code&gt; config generates. Apps get slow before they get broken, which sends you debugging application performance instead of network policy. I wrote about how the search domain expansion amplifies this in &lt;a href="https://guatulabs.dev/posts/wildcard-dns-ndots-5-the-tls-nightmare-and-how-to-fix-it/" rel="noopener noreferrer"&gt;the ndots:5 post&lt;/a&gt;; default deny turns every one of those expanded lookups into a 5-second black hole.&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure two: your ingress controller can't reach anything
&lt;/h3&gt;

&lt;p&gt;Traffic from Traefik or ingress-nginx to your backend pods is just pod-to-pod traffic crossing a namespace boundary. Default deny on the application namespace blocks it, and every service behind the ingress starts returning 502s and 504s. The application pods are healthy, the Service endpoints are populated, readiness probes pass (kubelet probes come from the node, and Calico permits them). Everything looks green except the part where users reach it. This also bites cert-manager: an HTTP-01 challenge needs the ingress controller to reach the temporary solver pod, so default deny can silently stall certificate issuance long after the initial rollout.&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure three: the webhook deadlock
&lt;/h3&gt;

&lt;p&gt;This is the one that turns a degraded cluster into a stuck one. Admission webhooks (Kyverno, cert-manager's webhook, anything with a &lt;code&gt;ValidatingWebhookConfiguration&lt;/code&gt;) receive calls from the API server. Deny ingress to the webhook pod and those calls time out. With &lt;code&gt;failurePolicy: Fail&lt;/code&gt;, the API server now rejects the operations that webhook gates, and the trap closes: the NetworkPolicy you're trying to apply to fix the problem is itself an API operation that flows through admission. You're locked out of the fix by the thing you broke.&lt;/p&gt;

&lt;p&gt;It gets worse if the policies are managed by automation. With a Kyverno generate rule or &lt;a href="https://guatulabs.dev/posts/gitops-for-homelabs-argocd-app-of-apps/" rel="noopener noreferrer"&gt;a GitOps controller&lt;/a&gt; syncing the policy, deleting the offending NetworkPolicy by hand buys you a few seconds before it's regenerated. You end up playing whack-a-mole against your own reconciliation loop while the cluster burns. The escape hatch is to pause the automation first (scale down Kyverno, disable ArgoCD auto-sync for that app), then remove the policy.&lt;/p&gt;

&lt;p&gt;A detail that matters here: API server traffic to webhooks often originates from the control plane host network, not from a pod you can match with a &lt;code&gt;podSelector&lt;/code&gt;. Allowing it means an &lt;code&gt;ipBlock&lt;/code&gt; rule for your control plane CIDR, or excluding webhook namespaces from default deny entirely. I do the latter.&lt;/p&gt;

&lt;h2&gt;
  
  
  A rollout order that works
&lt;/h2&gt;

&lt;p&gt;The fix for all three failures is the same discipline: never apply a deny you haven't already written the allows for, and never apply it wider than you can watch.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: one namespace, not the cluster
&lt;/h3&gt;

&lt;p&gt;Pick a single application namespace with low blast radius. Resist the urge to start cluster-wide; the whole point of the first namespace is to discover the flows you forgot existed. &lt;code&gt;kubectl get networkpolicy -A&lt;/code&gt; should stay boring while you learn.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: the baseline trio
&lt;/h3&gt;

&lt;p&gt;Default deny ships as a set of three policies applied together, in one &lt;code&gt;kubectl apply -f&lt;/code&gt; of one directory. The deny:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NetworkPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default-deny&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;team-a&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;
  &lt;span class="na"&gt;policyTypes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Ingress&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;Egress&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The DNS allow, which goes everywhere the deny goes, no exceptions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NetworkPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;allow-dns-egress&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;team-a&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;
  &lt;span class="na"&gt;policyTypes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Egress&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;egress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;namespaceSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;kubernetes.io/metadata.name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-system&lt;/span&gt;
          &lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;k8s-app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-dns&lt;/span&gt;
      &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;UDP&lt;/span&gt;
          &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;53&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
          &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;53&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both protocols matter. DNS falls back to TCP for large responses, and an egress rule that only allows UDP produces intermittent failures that are miserable to track down.&lt;/p&gt;

&lt;p&gt;The intra-namespace and ingress-controller allow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NetworkPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;allow-baseline-ingress&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;team-a&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;
  &lt;span class="na"&gt;policyTypes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Ingress&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;ingress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# any pod in this same namespace&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;
    &lt;span class="c1"&gt;# everything in the ingress controller's namespace&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;namespaceSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;kubernetes.io/metadata.name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ingress&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;kubernetes.io/metadata.name&lt;/code&gt; label is the load-bearing trick here. Since K8s 1.22, every namespace carries it automatically with its own name as the value, which gives you a stable way to select namespaces without inventing and maintaining your own labeling scheme.&lt;/p&gt;

&lt;p&gt;With the trio applied, check behavior from inside the namespace before moving on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# throwaway pod inside the locked-down namespace&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; team-a run probe &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="nt"&gt;--image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;busybox:1.36 &lt;span class="nt"&gt;--restart&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Never &lt;span class="nt"&gt;--&lt;/span&gt; sh
&lt;span class="c"&gt;# inside the pod:&lt;/span&gt;
nslookup kubernetes.default                                  &lt;span class="c"&gt;# should answer instantly&lt;/span&gt;
wget &lt;span class="nt"&gt;-qO-&lt;/span&gt; &lt;span class="nt"&gt;-T&lt;/span&gt; 2 http://api.team-b.svc.cluster.local           &lt;span class="c"&gt;# should time out&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fast DNS plus a slow, eventually-failing cross-namespace connection is the signature of a healthy baseline. Instant DNS failure means the allow-dns policy didn't land; an instant cross-namespace success means the deny didn't.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: log before you deny
&lt;/h3&gt;

&lt;p&gt;Calico's &lt;code&gt;Log&lt;/code&gt; rule action is the visibility tool the vanilla NetworkPolicy API doesn't have. Before tightening further, I put a logging policy behind the allows so I can see what the deny is about to catch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;projectcalico.org/v3&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;GlobalNetworkPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;log-unmatched&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;order&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4000&lt;/span&gt;                    &lt;span class="c1"&gt;# evaluated after everything else&lt;/span&gt;
  &lt;span class="na"&gt;namespaceSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;projectcalico.org/name == 'team-a'&lt;/span&gt;
  &lt;span class="na"&gt;types&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Ingress&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;Egress&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;ingress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Log&lt;/span&gt;
  &lt;span class="na"&gt;egress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Log&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the iptables dataplane, &lt;code&gt;Log&lt;/code&gt; uses the kernel LOG target, so dropped-candidate packets show up in the kernel log with a &lt;code&gt;calico-packet:&lt;/code&gt; prefix (configurable via &lt;code&gt;logPrefix&lt;/code&gt; in FelixConfiguration):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;journalctl &lt;span class="nt"&gt;-k&lt;/span&gt; &lt;span class="nt"&gt;--grep&lt;/span&gt; calico-packet


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two caveats. Kernel logging is noisy, so treat this as a diagnostic you enable for hours, not a permanent fixture. And the eBPF dataplane doesn't support the &lt;code&gt;Log&lt;/code&gt; action, so if you've switched dataplanes this tool isn't available.&lt;/p&gt;

&lt;p&gt;This step is where "set and forget" turns into something closer to auditing. Run a logging policy for a day against a namespace before enforcing, and you find the flows nobody documented: the metrics scraper, the backup job, the sidecar that phones a service in another namespace.&lt;/p&gt;

&lt;p&gt;One class of flow deserves special mention: anything running with &lt;code&gt;hostNetwork: true&lt;/code&gt;. Node-level monitoring agents and some bare-metal ingress deployments source their traffic from the node's IP, not a pod IP, so &lt;code&gt;podSelector&lt;/code&gt; and &lt;code&gt;namespaceSelector&lt;/code&gt; rules never match them. If scraping or health checks break only after enforcement, this is usually why, and the fix is an &lt;code&gt;ipBlock&lt;/code&gt; rule covering your node CIDR rather than another selector you'll fight with.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: the cluster-wide backstop
&lt;/h3&gt;

&lt;p&gt;Once the per-namespace pattern is proven, Calico's &lt;code&gt;GlobalNetworkPolicy&lt;/code&gt; enforces namespace isolation as a guardrail across every tenant namespace at once, with infrastructure explicitly carved out:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;projectcalico.org/v3&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;GlobalNetworkPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tenant-isolation-backstop&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;order&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
  &lt;span class="na"&gt;namespaceSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;-&lt;/span&gt;
    &lt;span class="s"&gt;projectcalico.org/name not in&lt;/span&gt;
    &lt;span class="s"&gt;{"kube-system", "calico-system", "calico-apiserver",&lt;/span&gt;
     &lt;span class="s"&gt;"ingress", "argocd", "cert-manager", "kyverno"}&lt;/span&gt;
  &lt;span class="na"&gt;types&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Ingress&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;Egress&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;egress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# DNS keeps working even where namespace policies are missing&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Allow&lt;/span&gt;
      &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;UDP&lt;/span&gt;
      &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;k8s-app == 'kube-dns'&lt;/span&gt;
        &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;53&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Allow&lt;/span&gt;
      &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
      &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;k8s-app == 'kube-dns'&lt;/span&gt;
        &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;53&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No explicit &lt;code&gt;Deny&lt;/code&gt; rule, and that's deliberate. In Calico, when at least one policy selects an endpoint and no rule allows the packet, the packet is dropped at the end of evaluation. The backstop selects everything outside the exclusion list, allows DNS, and lets the implicit deny do the rest.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;order: 3000&lt;/code&gt; is doing real work. Calico assigns Kubernetes NetworkPolicies an order of 1000, and lower order means earlier evaluation. An allow in a namespace's own policy terminates evaluation before the backstop is ever consulted. The backstop only catches traffic nothing else has claimed, which means namespaces with proper policies behave per their policies, and namespaces without any get isolation by default instead of the flat network.&lt;/p&gt;

&lt;p&gt;That exclusion list is the "infrastructure exclusion" pattern, and I'd argue it's the single most important decision in the whole rollout. The namespaces that run your CNI, your ingress, your GitOps controller, and your admission webhooks are the namespaces where a policy mistake costs you the ability to fix policy mistakes. Leave them out of automated enforcement. Write their policies by hand, later, one at a time, with the logging step in between.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: automate generation, with the same exclusions
&lt;/h3&gt;

&lt;p&gt;For new namespaces, a &lt;a href="https://guatulabs.dev/posts/kyverno-admission-controllers-policy-as-code-that-actually-works/" rel="noopener noreferrer"&gt;Kyverno generate rule&lt;/a&gt; stamps the baseline trio in automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kyverno.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;generate-default-deny&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default-deny&lt;/span&gt;
      &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;any&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;kinds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Namespace&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exclude&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;any&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;kinds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Namespace&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
              &lt;span class="na"&gt;names&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kube-system"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kube-public"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kube-node-lease"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;
                      &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;calico-system"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingress"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;argocd"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kyverno"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;generate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
        &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NetworkPolicy&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default-deny&lt;/span&gt;
        &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{request.object.metadata.name}}"&lt;/span&gt;
        &lt;span class="na"&gt;synchronize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
        &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;
            &lt;span class="na"&gt;policyTypes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Ingress&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;Egress&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two operational notes. &lt;code&gt;synchronize: true&lt;/code&gt; is what creates the regeneration loop from failure three: hand-deleting the generated policy gets it recreated within seconds, so during an incident you pause the ClusterPolicy before touching its output. And Kyverno treats generate rules as effectively immutable: if the generated resource definition is wrong, plan on deleting and recreating the ClusterPolicy rather than patching it in place.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this works
&lt;/h2&gt;

&lt;p&gt;The mental model that makes all of this predictable: Kubernetes NetworkPolicies are additive allow-lists with an implicit deny that activates the moment any policy selects a pod. There is no deny rule in the vanilla API. A pod selected by zero policies accepts everything; a pod selected by any policy accepts only what the union of matching policies allows. That's why the baseline trio works as a set: the deny policy flips the pod into isolated mode, and the other two define the allowed surface.&lt;/p&gt;

&lt;p&gt;Calico layers an ordered evaluation model on top. Policies are sorted by &lt;code&gt;order&lt;/code&gt;, rules within a policy run top to bottom, and the first &lt;code&gt;Allow&lt;/code&gt; or &lt;code&gt;Deny&lt;/code&gt; terminates evaluation. Kubernetes-native policies slot in at order 1000 (you can see the converted versions with &lt;code&gt;calicoctl get networkpolicy --all-namespaces&lt;/code&gt;, prefixed &lt;code&gt;knp.default.&lt;/code&gt;). Pods matched by no policy at all fall through to Calico's per-namespace profiles, which default to allow. That layering is exactly what makes the backstop-at-3000 pattern safe: specific intent at 1000 wins, the guardrail catches the remainder, and the logging policy at 4000 sees only what's about to die.&lt;/p&gt;

&lt;p&gt;Felix, Calico's per-node agent, also quietly saves you from the worst self-own. Its failsafe port list (SSH on 22, the API server on 6443, BGP on 179, etcd, Typha) is exempt from policy on host endpoints by default, so a bad policy can break your workloads without also locking you out of the nodes you need to fix it from. Don't shrink that list without a very specific reason.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons learned
&lt;/h2&gt;

&lt;p&gt;The failure modes are knowable in advance. DNS, ingress, and webhooks fail in that order every time, and writing the allows before the deny is cheaper in every way than discovering them from a monitoring graph. If a rollout plan doesn't mention &lt;code&gt;kube-dns&lt;/code&gt;, port 53, or &lt;code&gt;failurePolicy&lt;/code&gt;, it isn't done.&lt;/p&gt;

&lt;p&gt;Namespace-by-namespace beats cluster-wide, even though it feels slower. The first namespace takes a day because you're discovering undocumented flows. The tenth takes ten minutes because there's nothing left to discover. Going cluster-wide first inverts that: you discover everything at once, in production, with automation re-applying the breakage faster than you can remove it.&lt;/p&gt;

&lt;p&gt;Exclude infrastructure from automation permanently, not temporarily. Every system that can generate or sync policies (Kyverno, ArgoCD, your own scripts) should carry the same exclusion list for &lt;code&gt;kube-system&lt;/code&gt;, the CNI namespace, ingress, GitOps, and webhook namespaces. The asymmetry is stark: a missing policy in those namespaces costs you some security posture, while a wrong policy there costs you the control plane's ability to accept the fix.&lt;/p&gt;

&lt;p&gt;Logging is the difference between policy as guesswork and policy as engineering. The &lt;code&gt;Log&lt;/code&gt; action is crude (kernel log lines, iptables dataplane only), but it converts "why is this connection failing" from a hypothesis into a grep. I'd take crude visibility over elegant blindness in any network debugging session. This pattern, restrict by default and watch the boundary, is the same shape as the guardrails I build around &lt;a href="https://guatulabs.com/services" rel="noopener noreferrer"&gt;autonomous agent infrastructure&lt;/a&gt;: the deny is easy, and the engineering is in the observability that tells you what the deny will cost before you pay it.&lt;/p&gt;

&lt;p&gt;The thing the docs undersell is that default deny is a migration, not a manifest. The YAML is trivial. The work is the inventory of flows your cluster actually depends on, and you only get that inventory by watching one namespace at a time with the logs on.&lt;/p&gt;

</description>
      <category>calico</category>
      <category>networkpolicies</category>
      <category>kubernetes</category>
      <category>security</category>
    </item>
    <item>
      <title>CKS Success Story -44 Days Walking with AI-</title>
      <dc:creator>Aoi Takahashi</dc:creator>
      <pubDate>Mon, 15 Jun 2026 11:54:45 +0000</pubDate>
      <link>https://dev.to/aoi/cks-success-story-44-days-walking-with-ai--3j74</link>
      <guid>https://dev.to/aoi/cks-success-story-44-days-walking-with-ai--3j74</guid>
      <description>&lt;h2&gt;
  
  
  Hi! I'm Aoi! I Passed the CKS Exam — Here's My Story!
&lt;/h2&gt;

&lt;p&gt;I passed the CKS exam and I'm so happy I had to write about it!&lt;/p&gt;

&lt;p&gt;This time I asked AI to accompany me throughout my study journey, so I've added the subtitle &lt;strong&gt;"44 Days Walking with AI."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This article is translated from the original article written in Japanese.&lt;br&gt;
&lt;a href="https://zenn.dev/aoi/articles/5a816271cbb174" rel="noopener noreferrer"&gt;https://zenn.dev/aoi/articles/5a816271cbb174&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  My Skill Set
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;7 years working with Kubernetes as an SRE&lt;/li&gt;
&lt;li&gt;Experience with Linux server setup and administration&lt;/li&gt;
&lt;li&gt;CKA and CKAD certified&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SRE stands for Site Reliability Engineer — I work on platform maintenance (including Kubernetes) for developers. So I have a solid foundation in Kubernetes basics. Since we use AWS EKS, I rarely touch the Control Plane directly at work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Since I had AI accompanying me, I asked it to write the summary. Here's what it came up with:&lt;/p&gt;




&lt;h3&gt;
  
  
  Study Period
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;April 21, 2026 (Tue) – June 4, 2026 (Wed) = approximately 44 days&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I failed the first attempt (5/26) with 50 points, but regrouped quickly and passed on the second attempt (6/4).&lt;/p&gt;

&lt;h3&gt;
  
  
  Study Resources Used
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;KodeKloud CKS Course&lt;/strong&gt; (video + hands-on labs)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Killercoda&lt;/strong&gt; (&lt;code&gt;killercoda.com/killer-shell-cks&lt;/code&gt;) — all 34 scenarios completed twice&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Killer.sh&lt;/strong&gt; (practice exam included with the certification purchase, taken twice)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Killer Shell official YouTube&lt;/strong&gt; (full course, approximately 12 hours)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Official curriculum PDF&lt;/strong&gt; (&lt;code&gt;CKS_Curriculum_v1.34.pdf&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Study Flow
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Phase 1: Input Phase (4/21–5/6)
&lt;/h4&gt;

&lt;p&gt;Watched all KodeKloud video sections to build conceptual understanding of each domain.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;System Hardening (AppArmor / Seccomp / kube-bench)&lt;/li&gt;
&lt;li&gt;Cluster Setup (RBAC / NetworkPolicy / Ingress TLS / CSR)&lt;/li&gt;
&lt;li&gt;Microservice Vulnerabilities (PSA / gVisor / Cilium / Istio mTLS)&lt;/li&gt;
&lt;li&gt;Supply Chain Security (Trivy / SBOM / image signing / static analysis)&lt;/li&gt;
&lt;li&gt;Monitoring &amp;amp; Runtime Security (Falco / strace / Audit Logs)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Golden Week (Japanese national holidays) let me get significantly ahead of schedule.&lt;/p&gt;

&lt;h4&gt;
  
  
  Phase 2: Hands-on Phase (4/27–5/13)
&lt;/h4&gt;

&lt;p&gt;Worked through Killercoda scenarios in order. I also reviewed the official curriculum PDF myself and confirmed things like OPA Gatekeeper being out of scope — staying precise about what's actually on the exam was a priority.&lt;/p&gt;

&lt;h4&gt;
  
  
  Phase 3: Practice Exam Phase (5/15–5/20)
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Killer.sh attempt 1: 30-something points (ran out of time)&lt;/li&gt;
&lt;li&gt;Killer.sh attempt 2: 17 points&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The scores were low, but Killer.sh is intentionally harder than the real exam — the goal here was to identify weak spots.&lt;/p&gt;

&lt;h4&gt;
  
  
  Phase 4: First Exam Attempt (5/26)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Result: 50 points — failed&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Domains flagged as weak:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Supply Chain Security&lt;/li&gt;
&lt;li&gt;System Hardening&lt;/li&gt;
&lt;li&gt;Cluster Setup&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Phase 5: Re-exam Prep (5/28–6/3)
&lt;/h4&gt;

&lt;p&gt;Thoroughly analyzed where I got stuck in the first attempt and ran through all Killercoda scenarios again.&lt;/p&gt;

&lt;h4&gt;
  
  
  Phase 6: Second Attempt (6/4)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Passed!&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Study Hours
&lt;/h2&gt;

&lt;p&gt;Life with a dog keeps me busy, so I managed about 0–1 hours on weekdays and 1–3 hours on weekends. Since Golden Week fell during this period, I probably clocked around 10 hours over those few days.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;You must already hold the CKA to sit the CKS exam.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Did First
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Learning from Those Who Came Before
&lt;/h3&gt;

&lt;p&gt;I started by reading other people's exam write-ups to benefit from their experience.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://qiita.com/asami-okina/items/c0b1a1ebd5d43ac56e0c" rel="noopener noreferrer"&gt;https://qiita.com/asami-okina/items/c0b1a1ebd5d43ac56e0c&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://qiita.com/takahiro_fukushima/items/2479bae32c35dd93a847" rel="noopener noreferrer"&gt;https://qiita.com/takahiro_fukushima/items/2479bae32c35dd93a847&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Asami's article in particular was extremely clear — I came back to it many times.&lt;/p&gt;

&lt;h3&gt;
  
  
  Asking AI to Accompany Me
&lt;/h3&gt;

&lt;p&gt;After reading those articles and forming a rough picture of my study path, I asked AI: &lt;em&gt;"I want to pass CKS in one month — please be my study companion."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It then put together a study plan, and I followed it throughout.&lt;/p&gt;

&lt;h2&gt;
  
  
  Study Methods
&lt;/h2&gt;

&lt;p&gt;I worked through the following in order.&lt;/p&gt;

&lt;h3&gt;
  
  
  KodeKloud CKS Course
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://learn.kodekloud.com/courses/certified-kubernetes-security-specialist-cks" rel="noopener noreferrer"&gt;https://learn.kodekloud.com/courses/certified-kubernetes-security-specialist-cks&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Requires a paid KodeKloud plan, but you can watch topic videos and do hands-on practice side by side. The standard plan runs $35/month (roughly ¥6,000), which is enough to make you feel like you absolutely have to pass within the month. (Looking just now, the standard plan is apparently $29 — a price cut!?)&lt;/p&gt;

&lt;p&gt;Some content is a bit dated, so double-check the &lt;a href="https://training.linuxfoundation.org/ja/certification/certified-kubernetes-security-specialist/" rel="noopener noreferrer"&gt;official exam requirements&lt;/a&gt; before your attempt.&lt;/p&gt;

&lt;p&gt;I got through about 80% of KodeKloud before running low on time, so I switched to the Killercoda-focused approach described below. (I also consulted AI on when to make that switch.)&lt;/p&gt;

&lt;p&gt;I ended up not passing within the one-month window, but I felt confident enough that "cycling through Killercoda is sufficient" to cancel the subscription. I was pleasantly surprised that KodeKloud sends a friendly "your month is almost up!" reminder notification.&lt;/p&gt;

&lt;h3&gt;
  
  
  Killercoda
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://killercoda.com/killer-shell-cks" rel="noopener noreferrer"&gt;https://killercoda.com/killer-shell-cks&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A free hands-on environment. After getting through the KodeKloud input phase, I moved here for hands-on practice. KodeKloud walks you step-by-step toward the answer, so Killercoda is closer to the actual exam format.&lt;/p&gt;

&lt;h3&gt;
  
  
  Killer.sh
&lt;/h3&gt;

&lt;p&gt;The official practice exam bundled with the CKS purchase. I didn't use the practice exam when I took the CKA and regretted it, so this time I planned my schedule to include proper review time before sitting it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Quick AI Quizzes in Spare Moments
&lt;/h3&gt;

&lt;p&gt;Video watching and hands-on practice are heavy study methods. For tired moments or short waits during commutes, I'd ask AI to "give me a quiz" so I could keep studying in small windows.&lt;/p&gt;

&lt;p&gt;One funny thing: the first answer choice was always the correct one 😂 (I caught on eventually and asked it to randomize the order.)&lt;/p&gt;

&lt;p&gt;Sometimes I'd request multiple-choice questions; other times I'd ask for free-response prompts to check my command recall.&lt;/p&gt;

&lt;h3&gt;
  
  
  Studying After Failing the First Time
&lt;/h3&gt;

&lt;p&gt;I only got 50 points on my first attempt — devastating! The result breakdown tells you which domains you struggled with, so I focused on those weak areas and just hammered Killercoda.&lt;/p&gt;

&lt;p&gt;It paid off: &lt;strong&gt;I passed the second attempt with 80 points!!!!&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;It was rough! There were plenty of days I couldn't carve out study time, and the stress was through the roof. So relieved to be free of it now!&lt;/p&gt;

&lt;p&gt;I learned a lot — there were things I didn't know at all, and things I'd only half-understood that I finally got a solid grasp on.&lt;/p&gt;

</description>
      <category>cks</category>
      <category>kubernetes</category>
      <category>kubestronaut</category>
      <category>ai</category>
    </item>
    <item>
      <title>Troubleshooting Kubernetes Events with TKE and Tencent Cloud CLS</title>
      <dc:creator>Tencent Cloud -Cloud Log Service</dc:creator>
      <pubDate>Mon, 15 Jun 2026 11:06:29 +0000</pubDate>
      <link>https://dev.to/tencentcloud-cls/troubleshooting-kubernetes-events-with-tke-and-tencent-cloud-cls-1ncl</link>
      <guid>https://dev.to/tencentcloud-cls/troubleshooting-kubernetes-events-with-tke-and-tencent-cloud-cls-1ncl</guid>
      <description>&lt;h1&gt;
  
  
  Troubleshooting Kubernetes Events with TKE and Tencent Cloud CLS
&lt;/h1&gt;

&lt;p&gt;Cluster problems rarely appear from nowhere. Before a service outage becomes visible, Kubernetes often records smaller state changes: node pressure, Pod scheduling, Pod eviction, and cluster autoscaler decisions.&lt;/p&gt;

&lt;p&gt;Tencent Kubernetes Engine can send those Events into Tencent Cloud CLS, where they become searchable logs and dashboard data. This gives operators a central way to answer what changed, when it changed, which object was involved, and which component reported it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an Event tells you
&lt;/h2&gt;

&lt;p&gt;Kubernetes Events describe state transitions. The useful fields are:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;What to look for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Type&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Normal&lt;/code&gt;, &lt;code&gt;Warning&lt;/code&gt;, or a custom type.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Involved Object&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pod, Deployment, Node, or another Kubernetes object.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Source&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Component such as Scheduler or Kubelet.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Reason&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Short reason enum.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Message&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Detailed explanation.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Count&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;How many times it happened.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The core flow is: Kubernetes emits a state-change record, CLS stores it as a log event, and the operator filters by object, component, reason, message, count, and timestamp.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open Event Search
&lt;/h2&gt;

&lt;p&gt;In TKE, go to &lt;strong&gt;Cluster Operations -&amp;gt; Event Search&lt;/strong&gt;. CLS provides collection, storage, search, analysis, and dashboards for the event stream.&lt;/p&gt;

&lt;p&gt;Use the overview when you need warning distribution, affected object types, and event trends. Use global search when you already know the component or object name and need a row-level timeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Runbook 1: an abnormal node
&lt;/h2&gt;

&lt;p&gt;Filter by the abnormal node name in the event overview. In this example, the result included a node disk-space warning.&lt;/p&gt;

&lt;p&gt;The timeline showed that on &lt;code&gt;2020-11-25&lt;/code&gt;, node &lt;code&gt;172.16.18.13&lt;/code&gt; became abnormal because disk space was insufficient. Kubelet then tried to evict Pods from the node to reclaim disk space.&lt;/p&gt;

&lt;p&gt;That sequence gives you a clean next step: check node disk usage, eviction thresholds, and workload placement before treating it as a generic application failure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Runbook 2: autoscaler expansion
&lt;/h2&gt;

&lt;p&gt;For node pool autoscaling, query the autoscaler component:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;event.source.component:"cluster-autoscaler"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Display these fields:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;event.reason&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;event.message&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;event.involvedObject.name&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sort by log time descending. The result should work like a compact ledger of autoscaler decisions: workload object, reason, message, and the timestamp of each scaling step.&lt;/p&gt;

&lt;p&gt;The event stream showed scale-out around &lt;code&gt;2020-11-25 20:35:45&lt;/code&gt;, triggered by three nginx Pods:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;nginx-5dbf784b68-tq8rd&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;nginx-5dbf784b68-fpvbx&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;nginx-5dbf784b68-v9jv5&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Three nodes were added. Later scale-out did not continue because the node pool had reached its maximum node count.&lt;/p&gt;

&lt;h2&gt;
  
  
  Checklist
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Use Events to understand state changes, not only current state.&lt;/li&gt;
&lt;li&gt;Start with overview dashboards, then filter by object name.&lt;/li&gt;
&lt;li&gt;For node issues, inspect reason, message, source component, and count.&lt;/li&gt;
&lt;li&gt;For autoscaling, query &lt;code&gt;cluster-autoscaler&lt;/code&gt; and reconstruct the event timeline.&lt;/li&gt;
&lt;li&gt;Use metrics and logs after Events point you to the right object and time window.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why not only use &lt;code&gt;kubectl describe&lt;/code&gt;?
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;kubectl describe&lt;/code&gt; is useful for one object. CLS is better when you need searchable history, dashboards, and cross-object analysis.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the fastest autoscaler query?
&lt;/h3&gt;

&lt;p&gt;Start with &lt;code&gt;event.source.component:"cluster-autoscaler"&lt;/code&gt; and sort by log time descending.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>logging</category>
      <category>devops</category>
      <category>observability</category>
    </item>
  </channel>
</rss>
