<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Pratheesh Satheesh Kumar</title>
    <description>The latest articles on DEV Community by Pratheesh Satheesh Kumar (@pratheesh_s).</description>
    <link>https://dev.to/pratheesh_s</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1554347%2Ffa5e7a03-28d2-4924-8e18-dc817410b239.jpg</url>
      <title>DEV Community: Pratheesh Satheesh Kumar</title>
      <link>https://dev.to/pratheesh_s</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pratheesh_s"/>
    <language>en</language>
    <item>
      <title>5 Things Brisbane DevOps Teams Must Do After Kubernetes v1.36 Drops spec.externalIPs</title>
      <dc:creator>Pratheesh Satheesh Kumar</dc:creator>
      <pubDate>Sun, 17 May 2026 12:29:31 +0000</pubDate>
      <link>https://dev.to/pratheesh_s/5-things-brisbane-devops-teams-must-do-after-kubernetes-v136-drops-specexternalips-3b9a</link>
      <guid>https://dev.to/pratheesh_s/5-things-brisbane-devops-teams-must-do-after-kubernetes-v136-drops-specexternalips-3b9a</guid>
      <description>&lt;h2&gt;
  
  
  Your Kubernetes Clusters Just Got a Hard Deadline
&lt;/h2&gt;

&lt;p&gt;If your platform team is running Kubernetes in Brisbane — whether on-prem in a Fortitude Valley data centre, on EKS, AKS, or a bare-metal cluster out of a Queensland government facility — the v1.36 release has handed you a ticking clock.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;.spec.externalIPs&lt;/code&gt; field for Kubernetes Services is now &lt;strong&gt;formally deprecated&lt;/strong&gt;. Not soft-deprecated. Not "we recommend avoiding it." Formally, officially, documented-in-the-changelog deprecated — with full removal from &lt;code&gt;kube-proxy&lt;/code&gt; and conformance criteria coming in a future minor release. That future release could be v1.37 or v1.38. The Kubernetes release cadence is roughly every four months. You do not have years.&lt;/p&gt;

&lt;p&gt;This matters because &lt;code&gt;.spec.externalIPs&lt;/code&gt; has been a known security liability since CVE-2020-8554 was published in 2020. Any cluster where users are not 100% trusted — which is every multi-team cluster, every shared platform, every SaaS product running on Kubernetes — is potentially vulnerable. The deprecation is the project drawing a hard line on "insecure by default" patterns it can no longer defend.&lt;/p&gt;

&lt;p&gt;Here is exactly what your team needs to do, starting today.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Audit Every Service in Every Cluster for spec.externalIPs Usage
&lt;/h2&gt;

&lt;p&gt;Before you do anything else, you need ground truth. Run this against every cluster your team manages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get services &lt;span class="nt"&gt;--all-namespaces&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; json | &lt;span class="se"&gt;\&lt;/span&gt;
  jq &lt;span class="s1"&gt;'.items[] | select(.spec.externalIPs != null and .spec.externalIPs != []) | \
  {namespace: .metadata.namespace, name: .metadata.name, externalIPs: .spec.externalIPs}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're managing clusters across multiple environments — dev, staging, prod — run this in each context. If you use GitOps via ArgoCD or FluxCD, also grep your manifests in Git:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'externalIPs'&lt;/span&gt; ./manifests/ &lt;span class="nt"&gt;--include&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'*.yaml'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Document every hit. That list is your migration backlog.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Enable the DenyServiceExternalIPs Admission Controller Immediately
&lt;/h2&gt;

&lt;p&gt;Kubernetes has shipped the &lt;code&gt;DenyServiceExternalIPs&lt;/code&gt; admission controller since v1.21 specifically to block new usage of this field. If your clusters do not have it enabled, enable it now — before your next sprint.&lt;/p&gt;

&lt;p&gt;For clusters you manage directly (kubeadm, Talos, RKE2), add &lt;code&gt;DenyServiceExternalIPs&lt;/code&gt; to your API server's &lt;code&gt;--enable-admission-plugins&lt;/code&gt; flag. For managed Kubernetes on EKS or GKE, check whether the control plane exposes this — if not, implement a validating webhook that rejects any Service with a non-empty &lt;code&gt;spec.externalIPs&lt;/code&gt; field.&lt;/p&gt;

&lt;p&gt;This is not optional housekeeping. This stops the bleeding. No new usage means no new technical debt accumulating while you work through the existing backlog.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Migrate Workloads to Supported Alternatives — Today, Not Next Quarter
&lt;/h2&gt;

&lt;p&gt;The reason the Kubernetes project felt comfortable making this a hard deprecation is that the ecosystem now has mature alternatives. Your migration path depends on your cluster type:&lt;/p&gt;

&lt;h3&gt;
  
  
  On bare-metal or on-prem clusters (common in Brisbane enterprise and government)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MetalLB&lt;/strong&gt; — a battle-hardened load-balancer implementation for non-cloud environments. Supports BGP and Layer 2 modes. Drop-in replacement for most &lt;code&gt;spec.externalIPs&lt;/code&gt; use cases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cilium Gateway API&lt;/strong&gt; — if you're already running Cilium as your CNI, the Gateway API implementation covers the same use cases with better security primitives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gateway API (sig-network)&lt;/strong&gt; — the official successor to Ingress and the recommended path for new cluster networking going forward.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  On cloud-managed clusters (EKS, AKS, GKE)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;LoadBalancer&lt;/code&gt; type Services backed by your cloud provider's NLB or ALB. If you were using &lt;code&gt;spec.externalIPs&lt;/code&gt; to work around cost or complexity, that's a conversation worth having — the security tradeoff is not acceptable.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  For internal service-to-service routing
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;If you were using &lt;code&gt;spec.externalIPs&lt;/code&gt; for internal routing hacks, migrate to ExternalName Services, headless Services, or proper ingress/gateway patterns.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  4. Update Your IaC, Helm Charts, and GitOps Manifests
&lt;/h2&gt;

&lt;p&gt;Knowing you have a problem and fixing your running clusters is only half the job. If &lt;code&gt;spec.externalIPs&lt;/code&gt; is baked into a Helm chart — yours or a third-party one — the field will reappear on the next &lt;code&gt;helm upgrade&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Audit your Helm &lt;code&gt;values.yaml&lt;/code&gt; files and chart templates for &lt;code&gt;externalIPs&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Check upstream Helm charts you've vendored or depend on — open issues with maintainers if they haven't already removed the field.&lt;/li&gt;
&lt;li&gt;Update your ArgoCD/FluxCD ApplicationSets and Kustomize overlays.&lt;/li&gt;
&lt;li&gt;Add a CI check — a simple &lt;code&gt;grep&lt;/code&gt; in your pipeline — that fails any PR introducing &lt;code&gt;externalIPs&lt;/code&gt; into a manifest. Make the linter your enforcement mechanism before the admission controller becomes your last line of defence.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  5. Communicate the Change Across Your Platform Tenants
&lt;/h2&gt;

&lt;p&gt;If you run a shared Kubernetes platform — common in Brisbane's growing platform engineering teams across industries like resources, financial services, and state government — your tenants need to know this is coming.&lt;/p&gt;

&lt;p&gt;Send a platform notice this week. Include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What &lt;code&gt;spec.externalIPs&lt;/code&gt; is and why it's being removed (link to CVE-2020-8554)&lt;/li&gt;
&lt;li&gt;The timeline: deprecated now, removed in a future minor release&lt;/li&gt;
&lt;li&gt;The approved alternative patterns your platform supports&lt;/li&gt;
&lt;li&gt;A deadline for tenant teams to migrate their workloads&lt;/li&gt;
&lt;li&gt;Who to contact for migration support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Don't frame this as "Kubernetes is making a change." Frame it as "here's what your platform team is doing to keep your workloads secure, and here's what we need from you."&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bigger Picture for Brisbane Platform Teams
&lt;/h2&gt;

&lt;p&gt;The removal of &lt;code&gt;spec.externalIPs&lt;/code&gt; is a signal, not just a deprecation. The Kubernetes project is actively unwinding "insecure by default" decisions made in the early days of the ecosystem. More removals are coming — PodSecurityPolicy was just the beginning.&lt;/p&gt;

&lt;p&gt;Brisbane DevOps and SRE teams that treat deprecation notices as immediate action items — not future-sprint backlog items — are the ones whose platforms stay stable when the removal actually lands. The teams that ignore them are the ones getting paged at 2am when &lt;code&gt;kube-proxy&lt;/code&gt; stops routing traffic to a Service that still references a field that no longer exists.&lt;/p&gt;

&lt;p&gt;Audit today. Migrate this sprint. Enforce via admission control and CI. That's the playbook.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is spec.externalIPs in Kubernetes and why is it being removed?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;spec.externalIPs is a Service field that lets you assign additional IP addresses a Service responds on. It's being removed because it was designed assuming all cluster users are fully trusted — an assumption that doesn't hold in multi-team clusters. CVE-2020-8554 documented how this can be exploited, and the Kubernetes project has decided to formally deprecate it in v1.36 ahead of full removal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When will spec.externalIPs actually stop working in Kubernetes?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The field is formally deprecated in v1.36 (released May 2026). Full removal from kube-proxy and conformance criteria will happen in a future minor release — likely v1.37 or v1.38. Given Kubernetes releases roughly every four months, teams should treat this as a 4-8 month window at most.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What can I use instead of spec.externalIPs for on-prem or bare-metal Kubernetes clusters?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;MetalLB is the most widely adopted replacement for bare-metal and on-prem clusters, supporting both Layer 2 and BGP modes. The Kubernetes Gateway API (managed by SIG Network) is the strategic long-term path. If you're already running Cilium as your CNI, its Gateway API implementation is another strong option.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do I find out if my cluster is using spec.externalIPs?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Run: kubectl get services --all-namespaces -o json | jq '.items[] | select(.spec.externalIPs != null and .spec.externalIPs != [])' against each cluster context. Also grep your GitOps manifests and Helm chart templates for the string 'externalIPs' to catch usage baked into IaC.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is the DenyServiceExternalIPs admission controller and should I enable it?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;DenyServiceExternalIPs is a built-in Kubernetes admission controller that rejects any Service creation or update that includes a non-empty spec.externalIPs field. It has been available since v1.21. Yes — enable it on every cluster now. It prevents new usage from being introduced while you migrate existing workloads, and it functions as a policy enforcement layer ahead of the full removal.&lt;/p&gt;

</description>
      <category>kubernetesv136</category>
      <category>specexternalipsdeprecation</category>
      <category>kubernetesdeprecationbrisbane</category>
      <category>devopsbrisbane</category>
    </item>
    <item>
      <title>Why Agentic AI Changes Everything for Brisbane DevOps Teams — An Infrastructure Perspective</title>
      <dc:creator>Pratheesh Satheesh Kumar</dc:creator>
      <pubDate>Sun, 17 May 2026 12:29:19 +0000</pubDate>
      <link>https://dev.to/pratheesh_s/why-agentic-ai-changes-everything-for-brisbane-devops-teams-an-infrastructure-perspective-1ml1</link>
      <guid>https://dev.to/pratheesh_s/why-agentic-ai-changes-everything-for-brisbane-devops-teams-an-infrastructure-perspective-1ml1</guid>
      <description>&lt;h2&gt;
  
  
  Why Agentic AI Changes Everything for Brisbane DevOps Teams — An Infrastructure Perspective
&lt;/h2&gt;

&lt;p&gt;Agentic AI is not a smarter chatbot. It is a fundamentally different workload — and if your current infrastructure stack was designed around traditional LLM inference, you are already behind. The AMD and Red Hat announcement of May 2026 makes this explicit: agentic AI systems that reason, plan, execute multistep tasks, and call external tools continuously require a new infrastructure stack from the ground up. For Brisbane DevOps and platform engineering teams running Kubernetes-based workloads on OpenShift, EKS, or GKE, this is not a future problem. It is a now problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Happened
&lt;/h2&gt;

&lt;h3&gt;
  
  
  AMD and Red Hat Draw a Hard Line
&lt;/h3&gt;

&lt;p&gt;Red Hat's May 2026 blog post, published in partnership with AMD, makes a clear architectural argument: traditional inference infrastructure — optimised for single-shot, stateless model calls — is structurally mismatched to agentic workloads. Agents call models repeatedly in loops, fan out across tools and data sources, maintain state across multistep reasoning chains, and must run continuously and cost-effectively at scale.&lt;/p&gt;

&lt;p&gt;The implication is direct. The infrastructure stack that got your team through the first wave of LLM integration — a GPU node here, a FastAPI wrapper there, maybe a Helm chart for a model server — will not survive contact with production agentic workloads. Red Hat is positioning OpenShift AI alongside AMD Instinct accelerators as the answer, but the deeper message is architectural, not vendor-specific.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why It Matters for DevOps and Platform Engineering
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Agents Are a Continuous Workload, Not a Batch Job
&lt;/h3&gt;

&lt;p&gt;The single most disruptive characteristic of agentic AI for infrastructure teams is continuity. Traditional inference is stateless and bursty — a request comes in, a model responds, the compute is released. Agents are stateful and long-running. They maintain context, hold tool call queues, and re-enter reasoning loops. This breaks standard autoscaling assumptions built into KEDA and Kubernetes HPA, because the signal for "busy" is no longer just QPS — it is reasoning depth, tool latency, and memory pressure across multi-turn sessions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Observability Gaps Become Critical Failures
&lt;/h3&gt;

&lt;p&gt;Your existing Prometheus and Grafana dashboards were not built for agent trace visibility. An agent that silently loops, fails a tool call, or degrades in reasoning quality will not trigger a CPU alert. OpenTelemetry is the right foundation here, but instrumentation must extend into the agent framework layer — LangChain, LlamaIndex, or AutoGen — capturing span-level tool invocations, token consumption per reasoning step, and inter-agent message latency. Without this, your SRE team is flying blind on a workload that can burn GPU budget in minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  GitOps and IaC Assumptions Break at the Model Layer
&lt;/h3&gt;

&lt;p&gt;Teams managing infrastructure with ArgoCD and Terraform are accustomed to declarative, idempotent deployments. Agent systems introduce non-determinism at the application layer that bleeds into infrastructure decisions. Model versions, context window sizes, and tool configurations all affect infrastructure sizing — and they change frequently. Your IaC pipeline needs to treat model configuration as a first-class infrastructure variable, not an application concern left to the data science team.&lt;/p&gt;

&lt;h2&gt;
  
  
  Impact on Brisbane Teams
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Local Talent and Tooling Gap
&lt;/h3&gt;

&lt;p&gt;Brisbane's DevOps market is deep in Kubernetes and cloud-native tooling, but MLOps and LLMOps expertise remains genuinely scarce. According to Seek data from Q1 2026, ML/AI engineering roles in Queensland were advertised at three times the rate of available candidates. The practical consequence: most Brisbane platform teams will be asked to run agentic AI infrastructure before they have hired anyone with production LLMOps experience. That means the platform engineering team owns this problem whether they signed up for it or not.&lt;/p&gt;

&lt;h3&gt;
  
  
  OpenShift Shops Have a Head Start
&lt;/h3&gt;

&lt;p&gt;Brisbane enterprises running Red Hat OpenShift — common across Queensland government, resources, and financial services — have a concrete advantage here. The AMD and Red Hat integration ships with OpenShift AI operator support, meaning the serving layer, model registry, and pipeline orchestration are available within an existing GitOps-managed cluster. For teams already using ArgoCD and OpenShift Pipelines, the onboarding path to a production-grade agentic AI stack is shorter than starting from raw EKS.&lt;/p&gt;

&lt;h3&gt;
  
  
  FinOps Complexity Multiplies
&lt;/h3&gt;

&lt;p&gt;Agentic workloads are cost unpredictable by design. A single agent session can trigger dozens of model calls, each with variable token depth. Brisbane teams with mature FinOps practices — tagging strategies in AWS or Azure, budget alerts, Spot instance policies — will need to extend those controls to cover per-agent-session cost attribution. Without this, the first production agentic workload will blow a cloud budget in a way that is genuinely difficult to explain to a CFO.&lt;/p&gt;

&lt;h2&gt;
  
  
  What To Do Now
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Audit Your Current Stack Against Agentic Requirements
&lt;/h3&gt;

&lt;p&gt;This week, map your existing infrastructure against three agentic AI requirements: stateful workload support (do your Kubernetes configs handle long-running sessions?), deep observability (does your OpenTelemetry instrumentation reach the model and tool call layer?), and cost attribution (can you tag and track spend per agent session?). Identify the gaps before a project lands on your backlog.&lt;/p&gt;

&lt;h3&gt;
  
  
  Instrument One Agent Framework End-to-End
&lt;/h3&gt;

&lt;p&gt;Pick one agent framework — LangChain is the most common entry point — and instrument it fully with OpenTelemetry, exporting spans to your existing Grafana stack. This gives your SRE team a realistic picture of what agentic workload visibility actually looks like before you are under production pressure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Treat Model Config as Infrastructure
&lt;/h3&gt;

&lt;p&gt;Add model version, context window size, and tool manifest to your Terraform or Helm values files today. Enforce review through your existing GitOps pipeline. This costs almost nothing now and prevents a class of production incidents later.&lt;/p&gt;

&lt;h3&gt;
  
  
  Engage the AMD-Red Hat Stack Specifically
&lt;/h3&gt;

&lt;p&gt;If your team runs OpenShift, the OpenShift AI operator with AMD Instinct GPU support is worth a proof-of-concept this quarter. The integration is mature enough for non-production validation and the architectural patterns it enforces — model serving via vLLM, pipeline orchestration via Kubeflow Pipelines — are transferable even if you later diverge from the vendor stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is the difference between agentic AI and traditional LLM inference infrastructure?
&lt;/h3&gt;

&lt;p&gt;Traditional LLM inference handles stateless, single-turn requests — one input, one output, compute released. Agentic AI infrastructure must support stateful, multistep reasoning loops where a model calls tools, processes results, re-enters reasoning, and maintains context across many model invocations in a single session. This requires different autoscaling, observability, and cost management approaches.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does Kubernetes support agentic AI workloads out of the box?
&lt;/h3&gt;

&lt;p&gt;Not without modification. Standard Kubernetes autoscaling (HPA, KEDA) is optimised for request-based scaling signals. Agentic workloads require custom metrics — reasoning session depth, tool call queue length, memory pressure — to scale correctly. Teams need to instrument their agent frameworks with OpenTelemetry and expose custom metrics to the Kubernetes control plane.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does OpenShift AI fit into an agentic AI infrastructure stack?
&lt;/h3&gt;

&lt;p&gt;OpenShift AI provides a Kubernetes-native operator stack covering model serving (via vLLM), model registry, pipeline orchestration (Kubeflow Pipelines), and GPU resource management. For teams already running OpenShift with ArgoCD-based GitOps, it extends the existing declarative infrastructure model to the AI serving layer, reducing the operational gap between platform engineering and ML workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  What should Brisbane DevOps teams prioritise first when preparing for agentic AI?
&lt;/h3&gt;

&lt;p&gt;Observability first. Before you optimise compute or redesign autoscaling, instrument your agent framework with OpenTelemetry so you can see what is actually happening inside a running agent session. Blind optimisation of agentic infrastructure is expensive and slow. Visibility gives your team the data to make every subsequent decision correctly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is the difference between agentic AI and traditional LLM inference infrastructure?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Traditional LLM inference handles stateless, single-turn requests — one input, one output, compute released. Agentic AI infrastructure must support stateful, multistep reasoning loops where a model calls tools, processes results, re-enters reasoning, and maintains context across many model invocations in a single session. This requires different autoscaling, observability, and cost management approaches.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does Kubernetes support agentic AI workloads out of the box?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not without modification. Standard Kubernetes autoscaling (HPA, KEDA) is optimised for request-based scaling signals. Agentic workloads require custom metrics — reasoning session depth, tool call queue length, memory pressure — to scale correctly. Teams need to instrument their agent frameworks with OpenTelemetry and expose custom metrics to the Kubernetes control plane.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How does OpenShift AI fit into an agentic AI infrastructure stack?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;OpenShift AI provides a Kubernetes-native operator stack covering model serving via vLLM, model registry, pipeline orchestration via Kubeflow Pipelines, and GPU resource management. For teams already running OpenShift with ArgoCD-based GitOps, it extends the existing declarative infrastructure model to the AI serving layer, reducing the operational gap between platform engineering and ML workloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What should Brisbane DevOps teams prioritise first when preparing for agentic AI?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Observability first. Before you optimise compute or redesign autoscaling, instrument your agent framework with OpenTelemetry so you can see what is actually happening inside a running agent session. Blind optimisation of agentic infrastructure is expensive and slow. Visibility gives your team the data to make every subsequent decision correctly.&lt;/p&gt;

</description>
      <category>agenticai</category>
      <category>infrastructurestack</category>
      <category>devops</category>
      <category>platformengineering</category>
    </item>
    <item>
      <title>Why Google + Wiz Changes Everything for Brisbane Multicloud Teams — A DevSecOps Perspective</title>
      <dc:creator>Pratheesh Satheesh Kumar</dc:creator>
      <pubDate>Sun, 17 May 2026 11:46:40 +0000</pubDate>
      <link>https://dev.to/pratheesh_s/why-google-wiz-changes-everything-for-brisbane-multicloud-teams-a-devsecops-perspective-4oea</link>
      <guid>https://dev.to/pratheesh_s/why-google-wiz-changes-everything-for-brisbane-multicloud-teams-a-devsecops-perspective-4oea</guid>
      <description>&lt;h2&gt;
  
  
  The Acquisition Nobody in Brisbane DevOps Should Ignore
&lt;/h2&gt;

&lt;p&gt;Google has completed its acquisition of Wiz — and if you're running multicloud infrastructure in Brisbane right now, your security architecture just got more complicated, not simpler.&lt;/p&gt;

&lt;p&gt;At RSA Conference 2026, Google Cloud's Office of the CISO framed this as a gift for CISOs navigating fragmented cloud environments. And technically, they're right. But for platform engineers, SREs, and DevSecOps leads on the ground in Queensland, the real question isn't whether the deal is good for Google — it's whether it reshapes the security tooling decisions you're making today across AWS, GCP, Azure, and OCI.&lt;/p&gt;

&lt;p&gt;The answer is yes. And the implications run deeper than most vendor announcements.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Google + Wiz Actually Changes
&lt;/h2&gt;

&lt;p&gt;Wiz built its reputation as the cloud-native security platform that didn't care which cloud you used. It sat above your infrastructure, correlated risks across environments, and gave security teams a single graph of exposure across AWS, Azure, and GCP. That vendor-neutrality was the product.&lt;/p&gt;

&lt;p&gt;Now Google owns it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Vendor Neutrality Question
&lt;/h3&gt;

&lt;p&gt;This is the first thing Brisbane DevSecOps teams need to pressure-test. Wiz's value proposition was built on independence. Enterprise customers — particularly those running genuine multicloud strategies across AWS-primary workloads with GCP for analytics or Azure for Microsoft integration — chose Wiz &lt;em&gt;because&lt;/em&gt; it had no cloud allegiance.&lt;/p&gt;

&lt;p&gt;With Google as the parent, three scenarios are plausible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Best case:&lt;/strong&gt; Google maintains Wiz as a genuinely neutral platform, using it to win trust across non-GCP workloads and expand overall market share&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Middle case:&lt;/strong&gt; Wiz deepens GCP integration first, creating subtle feature advantages on Google Cloud that erode parity over time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Worst case:&lt;/strong&gt; Wiz becomes a GCP acquisition funnel — excellent security visibility that happens to surface migration recommendations toward Google Cloud&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No Brisbane platform team should assume best case without watching the roadmap closely over the next 12 months.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for Your Current Security Tooling Stack
&lt;/h2&gt;

&lt;p&gt;If Wiz is already in your stack — and many Queensland enterprise and government-adjacent organisations use it — you're not in immediate danger. But you are at a strategic inflection point.&lt;/p&gt;

&lt;h3&gt;
  
  
  GitOps and IaC Security Pipelines
&lt;/h3&gt;

&lt;p&gt;Wiz's CNAPP capabilities integrate with CI/CD pipelines through IaC scanning — catching misconfigurations in Terraform, Helm charts, and Kubernetes manifests before they reach production. This is where the Google acquisition gets genuinely interesting for platform engineering teams.&lt;/p&gt;

&lt;p&gt;Google's investment in developer tooling (Cloud Code, Cloud Build, Artifact Registry) combined with Wiz's shift-left security scanning could produce a tightly integrated DevSecOps workflow — but one that's optimised for GCP-native pipelines. If your CI/CD runs on GitHub Actions deploying to EKS or Azure AKS, watch whether Wiz's pipeline integrations remain equally maintained.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kubernetes and Container Security
&lt;/h3&gt;

&lt;p&gt;Wiz's Kubernetes security posture management is a genuine differentiator. For Brisbane teams running EKS, AKS, GKE, or OpenShift, the platform's ability to correlate container-level risks with cloud identity and network exposure is exactly what mature DevSecOps looks like.&lt;/p&gt;

&lt;p&gt;Post-acquisition, expect GKE integration to deepen fastest. Teams running non-GKE Kubernetes should explicitly ask Google's Wiz team about parity commitments — and get them in writing in any enterprise agreements.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Competitive Shift in Multicloud Security Tooling
&lt;/h2&gt;

&lt;p&gt;Before this acquisition, the CNAPP market had clear independent players: Wiz, Orca Security, Lacework, and Prisma Cloud. Now the largest is Google-owned.&lt;/p&gt;

&lt;p&gt;For Brisbane organisations evaluating security tooling in 2026, this changes the vendor landscape:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Orca Security&lt;/strong&gt; becomes the obvious alternative for teams prioritising genuine cloud-neutrality&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prisma Cloud (Palo Alto)&lt;/strong&gt; gains narrative momentum as the independent enterprise CNAPP&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Security Hub + GuardDuty&lt;/strong&gt; becomes more attractive for AWS-primary shops who don't want GCP-owned tooling in their security data pipeline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open-source alternatives&lt;/strong&gt; (Falco, Trivy, OpenSCAP in combination) become worth reconsidering for cost-conscious platform teams&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  One Action Brisbane Platform Teams Should Take This Week
&lt;/h2&gt;

&lt;p&gt;Before your next architecture review, map every Wiz integration point in your current stack and answer three questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Which clouds does Wiz monitor for us, and are any non-GCP environments in scope?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Does our current Wiz contract include data residency commitments, and does Google's ownership change our assessment?&lt;/strong&gt; (Particularly relevant for Queensland government and health sector workloads)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What would our security posture look like if we needed to replace Wiz in 18 months?&lt;/strong&gt; Running a lightweight parallel evaluation of one alternative CNAPP is cheap insurance.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This isn't about abandoning Wiz. It's about making a deliberate choice rather than drifting into vendor lock-in by default.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bigger Picture for Brisbane's DevSecOps Maturity
&lt;/h2&gt;

&lt;p&gt;The Google + Wiz deal signals something broader: the era of neutral cloud security tooling is compressing. The hyperscalers are acquiring the independent security layer, just as they previously acquired the monitoring layer (think AWS acquiring CloudWatch capabilities, Azure absorbing Sentinel).&lt;/p&gt;

&lt;p&gt;For Brisbane DevSecOps teams building long-term platform engineering capability, the strategic lesson isn't about Wiz specifically — it's about designing security architectures that don't create single points of vendor dependency at the CNAPP layer.&lt;/p&gt;

&lt;p&gt;Observability stacks learned this lesson with the OpenTelemetry movement. Security tooling is next.&lt;/p&gt;

&lt;p&gt;Build your GitOps pipelines and IaC security gates with pluggable controls. Use Wiz (or any CNAPP) as an aggregation and correlation layer, not as the only gate between your developers and production. That way, when the next acquisition reshapes your tooling landscape — and it will — your platform team adapts without a crisis.&lt;/p&gt;

&lt;p&gt;The Brisbane organisations that treat this acquisition as a prompt to audit their security architecture dependencies will be better positioned than those that wait for Google to make the decision for them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Does the Google acquisition of Wiz mean Wiz will stop supporting AWS and Azure environments?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not immediately — Google has indicated Wiz will continue as a multicloud security platform. However, Brisbane teams should monitor whether GCP integrations receive preferential feature development over time and build vendor evaluation checkpoints into their annual security architecture reviews.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Should Brisbane organisations using Wiz be worried about their data being accessible to Google?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This depends on your workload classification. Organisations in Queensland government, health, or financial services should review existing Wiz data processing agreements in light of Google's ownership and confirm whether data residency and sovereignty commitments remain unchanged under the new parent entity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How does the Google + Wiz deal affect CI/CD pipeline security for DevSecOps teams?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Wiz's IaC scanning and shift-left capabilities will likely integrate more deeply with Google Cloud Build and Cloud Code over time. Teams running CI/CD pipelines on GitHub Actions, GitLab, or Jenkins deploying to non-GCP environments should explicitly validate that Wiz's pipeline integrations remain equally supported.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What are the best Wiz alternatives for multicloud security if we want to stay cloud-neutral?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Orca Security, Prisma Cloud (Palo Alto Networks), and Lacework are the primary enterprise CNAPP alternatives. For platform teams comfortable with open-source tooling, combining Trivy for container scanning, Falco for runtime security, and OPA for policy enforcement provides a vendor-neutral alternative stack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does this acquisition change how Brisbane companies should approach their multicloud strategy?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes — it reinforces the importance of designing security architectures with pluggable, interchangeable controls rather than deep dependency on any single CNAPP vendor. The acquisition accelerates a trend where hyperscalers own the security layer, making architectural flexibility a competitive advantage for platform teams.&lt;/p&gt;

</description>
      <category>googlewizacquisition</category>
      <category>multicloudsecurity</category>
      <category>devsecopsbrisbane</category>
      <category>cnappplatformengineering</category>
    </item>
    <item>
      <title>The Brisbane First-Home Buyers Caught in the Middle of an Overheated Starter Market</title>
      <dc:creator>Pratheesh Satheesh Kumar</dc:creator>
      <pubDate>Sun, 17 May 2026 11:44:43 +0000</pubDate>
      <link>https://dev.to/pratheesh_s/the-brisbane-first-home-buyers-caught-in-the-middle-of-an-overheated-starter-market-1h3j</link>
      <guid>https://dev.to/pratheesh_s/the-brisbane-first-home-buyers-caught-in-the-middle-of-an-overheated-starter-market-1h3j</guid>
      <description>&lt;h2&gt;
  
  
  Brisbane's Under-$1M Market Is No Longer a Safe Harbour — It's a Battleground
&lt;/h2&gt;

&lt;p&gt;For years, the sub-$1M price bracket in Brisbane was the realistic entry point. The place where first-home buyers could compete without facing off against cashed-up investors or interstate upsizers. That bracket still exists — but right now, in May 2026, it has become the most contested segment in South-East Queensland property.&lt;/p&gt;

&lt;p&gt;Gen Z buyers — those born between 1997 and 2012 — have arrived in the market with urgency, and the numbers are real. Agents are reporting unsolicited calls flooding in from homeowners who haven't even listed, competition among buyers at open homes is intensifying, and properties that would have sat for weeks are clearing in days. The word being used by people who do this for a living? Overheated.&lt;/p&gt;

&lt;p&gt;So what does that actually mean for a 26-year-old in Chermside trying to buy their first home? Or a couple in Wynnum who've been saving since 2023? It means the game has changed — and the margin for error just got a lot thinner.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Gen Z Is Hitting the Market Hard Right Now
&lt;/h2&gt;

&lt;p&gt;This isn't spontaneous. Several forces have converged to push younger Australians into purchase mode at the same time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rate relief is real&lt;/strong&gt;: After a prolonged cycle of rate rises that sidelined many would-be buyers, recent cash rate reductions have restored borrowing capacity. For a Gen Z buyer earning $90,000, even a 0.5% rate reduction can mean an additional $30,000–$40,000 in borrowing power.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rental pain has hit a breaking point&lt;/strong&gt;: Brisbane rents have climbed sharply since 2022. For many Gen Z residents, a mortgage repayment on a modest home is now within reach of — or even below — what they're paying in rent. Buying stops feeling aspirational and starts feeling logical.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FOMO is structural, not emotional&lt;/strong&gt;: This generation watched older millennials get priced out of Sydney and Melbourne. They're not waiting to see what happens in Brisbane. They're moving.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parental support is accelerating timelines&lt;/strong&gt;: The 'Bank of Mum and Dad' is increasingly active in Queensland, giving younger buyers gifted deposits that compress what might have been a 5-year savings journey into 18 months.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Real Cost of an Overheated Entry Market
&lt;/h2&gt;

&lt;p&gt;Here's what doesn't make the headline: when the starter home segment overheats, it doesn't just hurt buyers — it reshapes the entire financial plan they built to get there.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pre-Approvals Become a Moving Target
&lt;/h3&gt;

&lt;p&gt;A buyer who received pre-approval three months ago based on a purchase price of $750,000 may now find that the homes in that range look nothing like what they expected. Compromise is happening at the sharp end — smaller blocks, older stock, less desirable pockets. That's not just a lifestyle adjustment. It can affect lender valuations, which affects how much you can actually borrow against the property.&lt;/p&gt;

&lt;h3&gt;
  
  
  Emotional Decision-Making Drives Financial Risk
&lt;/h3&gt;

&lt;p&gt;Competitive markets create time pressure. Time pressure creates emotional decisions. Buyers who skip building and pest inspections, waive cooling-off periods, or stretch their budget to 'win' a property are taking on risks that will follow them for decades. A $20,000 overbid to secure a home today could mean years of financial stress if unexpected repairs emerge or life circumstances change.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Deposit Gap Widens Faster Than Savings Can Keep Up
&lt;/h3&gt;

&lt;p&gt;For buyers who aren't ready yet — who are still 12 to 18 months from a sufficient deposit — an overheated market is running away from them in real time. Every month of price growth in the sub-$1M segment increases the deposit required under standard LVR thresholds and compounds the Lenders Mortgage Insurance exposure for those without a 20% buffer.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Brisbane First-Home Buyers Should Do Right Now
&lt;/h2&gt;

&lt;p&gt;If you're currently in the market, or planning to enter within the next six months, here's what matters today — not eventually, not 'when things settle down':&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Get a full finance assessment this week, not a rule-of-thumb estimate.&lt;/strong&gt; Know exactly what you can borrow, what conditions apply, and what your buffer looks like. Markets like this punish buyers who discover their limits at the offer stage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Understand your LMI exposure and whether it's worth paying.&lt;/strong&gt; In a rising market, paying LMI to enter 12 months earlier can cost less than the additional deposit you'd need to save while prices move. Run the numbers honestly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expand your search zone deliberately, not desperately.&lt;/strong&gt; Identify Brisbane suburbs within 15km of the CBD where the fundamentals are strong but competition is slightly less frenzied — areas like Zillmere, Nudgee, or parts of Redcliffe — before the crowd catches up.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lock in your pre-approval and keep it current.&lt;/strong&gt; Pre-approvals typically expire after 90 days. In a fast-moving market, an expired pre-approval at the moment of purchase can cost you the property.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Talk to a mortgage broker before you talk to an agent.&lt;/strong&gt; Knowing your ceiling before you walk through an open home is the single most protective thing a first-home buyer can do in an overheated market.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Bigger Picture for Brisbane Property in 2026
&lt;/h2&gt;

&lt;p&gt;Brisbane's sub-$1M market becoming 'overheated' is not a surprise to anyone watching the fundamentals — it's the logical outcome of population growth, constrained supply, rental market pressure, and recovering borrowing capacity converging at the same time.&lt;/p&gt;

&lt;p&gt;For Gen Z buyers, the window to enter the market at a starter price point is real but not infinite. For those watching from the sidelines, waiting for the market to cool before engaging carries its own risk. The buyers who act with the best financial preparation — not the most optimism — are the ones who'll look back on 2026 as the year they got in.&lt;/p&gt;

&lt;p&gt;If you're a Brisbane first-home buyer trying to make sense of this market, the smartest next step isn't scrolling more listings. It's getting clear on your numbers — today.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Why is Brisbane's property market so competitive for first-home buyers right now?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A combination of factors has hit at once in 2026: recent interest rate reductions have restored borrowing power, Brisbane rents have risen to the point where mortgage repayments are comparable, and Gen Z buyers who feared being priced out are entering the market en masse. The under-$1M bracket — the traditional entry point — is bearing the full weight of that demand with limited stock to absorb it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Should I still try to buy a first home in Brisbane if the market is 'overheated'?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That depends entirely on your personal financial position, not the market headline. If your deposit is solid, your borrowing capacity is clear, and you have a financial buffer for the unexpected, entering the market in a competitive environment can still make long-term sense — particularly if you're currently paying high rent. Waiting for a 'cooler' market carries its own cost if prices continue to rise while you save. Get a proper finance assessment before deciding either way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How does Lenders Mortgage Insurance (LMI) work for first-home buyers in Brisbane?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;LMI is charged when a buyer borrows more than 80% of a property's value — in other words, when the deposit is less than 20%. In Brisbane's current market, some first-home buyers are weighing up whether paying LMI now to enter the market earlier is financially smarter than waiting until they've saved a full 20% deposit, particularly if property prices continue rising. The calculation depends on your specific situation and should be worked through with a mortgage broker.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is a pre-approval and why does it matter so much in a competitive property market?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A pre-approval is a conditional indication from a lender of how much they're willing to lend you, based on your financial situation. In a competitive market like Brisbane's current sub-$1M segment, having a current pre-approval means you can move quickly when a property comes up — without the delay of starting a finance application from scratch. Pre-approvals typically last 90 days and need to be renewed. Buyers without one often lose out to buyers who are already finance-ready.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which Brisbane suburbs are still affordable for first-home buyers in 2026?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;While the inner-ring and middle suburbs have seen the sharpest competition, some areas still offer entry-level opportunities with strong fundamentals. Suburbs in the northern and eastern corridors — including parts of Zillmere, Nudgee, Redcliffe, and the Moreton Bay fringe — have historically offered more accessibility. However, these areas are also attracting increased attention as buyers broaden their search, so conditions can change quickly. A mortgage broker familiar with Brisbane can help you understand what price point is realistic for your borrowing capacity and which areas suit your situation.&lt;/p&gt;

</description>
      <category>brisbanefirsthomebuyers</category>
      <category>genzpropertybuyersbrisbane</category>
      <category>brisbanestarterhomes2026</category>
      <category>overheated</category>
    </item>
    <item>
      <title>How Kubernetes v1.36 Fixes the Horizontal Controller Scaling Problem</title>
      <dc:creator>Pratheesh Satheesh Kumar</dc:creator>
      <pubDate>Wed, 13 May 2026 08:01:02 +0000</pubDate>
      <link>https://dev.to/pratheesh_s/how-kubernetes-v136-fixes-the-horizontal-controller-scaling-problem-4619</link>
      <guid>https://dev.to/pratheesh_s/how-kubernetes-v136-fixes-the-horizontal-controller-scaling-problem-4619</guid>
      <description>&lt;h2&gt;
  
  
  The Real Cost of Watching Everything
&lt;/h2&gt;

&lt;p&gt;Run a horizontally scaled controller across a large Kubernetes cluster—say, three replicas of a custom resource controller in a 5,000-node cluster. Each replica receives the &lt;em&gt;entire&lt;/em&gt; event stream from the API server. It deserializes every Pod, every ConfigMap, every change, filters out 66% of them, and discards the rest. Multiply that waste by dozens of controllers and you're burning CPU and bandwidth on work that never had to happen.&lt;/p&gt;

&lt;p&gt;This is the scaling wall that Kubernetes operators hit, and Kubernetes v1.36 finally addresses it head-on with &lt;strong&gt;server-side sharded list and watch&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Client-Side Sharding Isn't Enough
&lt;/h2&gt;

&lt;p&gt;Some controllers already implement horizontal sharding—tools like kube-state-metrics assign each replica a slice of the keyspace and discard irrelevant objects locally. Sounds reasonable until you map the actual costs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Deserialization waste&lt;/strong&gt;: N replicas each deserialize the full event stream, even though (N-1) of them will throw away most of it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network scales wrong&lt;/strong&gt;: Bandwidth grows with the number of replicas, not shrinks with the shard size. Three replicas = three times the API server egress.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU efficiency tanks&lt;/strong&gt;: Every CPU cycle spent parsing objects you'll discard is a cycle you could've spent on actual work.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The problem isn't the sharding logic—it's that all filtering happens &lt;em&gt;after&lt;/em&gt; the data leaves the API server. You're paying the full cost upfront, then hoping the controller will do the right math on its end.&lt;/p&gt;

&lt;h2&gt;
  
  
  Server-Side Sharding Changes the Game
&lt;/h2&gt;

&lt;p&gt;Instead of filtering downstream, Kubernetes v1.36 moves the filter upstream. Controllers now tell the API server exactly which slice of the keyspace they own, and the API server &lt;em&gt;only sends matching events&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The mechanism is deceptively simple: a new &lt;code&gt;shardSelector&lt;/code&gt; field in &lt;code&gt;ListOptions&lt;/code&gt; lets you specify a hash range. When you request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;opts&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;metav1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ListOptions&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ShardSelector&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;metav1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ShardSelector&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Index&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c"&gt;// This replica is shard 0&lt;/span&gt;
        &lt;span class="n"&gt;Total&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c"&gt;// Out of 3 total shards&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;pods&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;clientset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CoreV1&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Pods&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metav1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NamespaceAll&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The API server hashes each object's namespace and name, maps it to a shard range, and filters at the source. Only events matching your shard ever leave the server.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Changes for You
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Immediate wins:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lower per-replica CPU&lt;/strong&gt;: No wasted deserialization cycles. Each replica only processes what it owns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reduced network&lt;/strong&gt;: API server sends 1/N of the traffic per replica. Scale to 10 replicas? You've slashed per-replica egress by 90%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Better controller responsiveness&lt;/strong&gt;: Smaller event streams mean faster reconciliation loops and lower latency on watch operations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tradeoffs to know:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;This is alpha in v1.36, so expect the API surface to evolve. Don't ship it to production yet.&lt;/li&gt;
&lt;li&gt;Your controller code needs to know its shard assignment and pass it on every list/watch call. If you're using a framework like kubebuilder, watch for patches that handle this automatically.&lt;/li&gt;
&lt;li&gt;Hash collisions are handled deterministically—objects map to shards based on &lt;code&gt;fnv.New32a&lt;/code&gt; hash of their namespace/name. The distribution is uniform as long as your keyspace is reasonably large.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  A Concrete Migration Path
&lt;/h2&gt;

&lt;p&gt;If you maintain a horizontally scaled controller or metrics exporter:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Check the feature gate&lt;/strong&gt;: &lt;code&gt;ServerSideShardedListAndWatch=true&lt;/code&gt; (alpha).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit your watch/list calls&lt;/strong&gt;: Any place where you're already doing client-side filtering is a candidate for server-side sharding.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement shard assignment&lt;/strong&gt;: Use a simple integer (e.g., from a downward API env var or StatefulSet ordinal) to determine your replica's shard.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test in a lab cluster first&lt;/strong&gt;: The hash function is deterministic, but edge cases around large resource counts should be validated before production.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Example: derive shard index from pod ordinal&lt;/span&gt;
&lt;span class="n"&gt;ordinality&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"ORDINAL"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c"&gt;// "0", "1", "2", etc.&lt;/span&gt;
&lt;span class="n"&gt;shardIndex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;strconv&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Atoi&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ordinality&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;shardTotal&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt; &lt;span class="c"&gt;// Replicas in your deployment&lt;/span&gt;

&lt;span class="c"&gt;// Apply to every list/watch&lt;/span&gt;
&lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ShardSelector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;metav1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ShardSelector&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Index&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shardIndex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Total&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shardTotal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why This Matters Now
&lt;/h2&gt;

&lt;p&gt;Kubernetes clusters are getting bigger, and the pressure on the API server is mounting. Every optimization that pushes filtering logic upstream—whether it's field selectors, label selectors, or now shard selectors—buys you headroom to scale controllers without hitting a resource wall.&lt;/p&gt;

&lt;p&gt;Server-side sharded list and watch is especially important for anyone running high-cardinality watch operations: Pod controllers, node-level agents, cost optimizers, security scanners. For teams operating 1,000+ node clusters with dozens of custom controllers, this can be the difference between stable API server load and constant firefighting.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Question for Your Cluster
&lt;/h2&gt;

&lt;p&gt;Are your horizontally scaled controllers already doing some form of client-side sharding to stay sane? If so, server-side sharding is probably worth experimenting with in your next lab run. And if you're &lt;em&gt;not&lt;/em&gt; doing sharding yet but you've got multiple replicas of a watcher—you're probably leaving performance on the table.&lt;/p&gt;

&lt;p&gt;What's the largest cluster you're running, and how many custom controllers are watching the same resources? I'd love to hear whether this lands on your v1.36 roadmap.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>platform</category>
      <category>cloud</category>
    </item>
    <item>
      <title>Why AI Sandboxing Needs Kubernetes—And Why You Should Care Now</title>
      <dc:creator>Pratheesh Satheesh Kumar</dc:creator>
      <pubDate>Sat, 09 May 2026 08:44:39 +0000</pubDate>
      <link>https://dev.to/pratheesh_s/why-ai-sandboxing-needs-kubernetes-and-why-you-should-care-now-5djk</link>
      <guid>https://dev.to/pratheesh_s/why-ai-sandboxing-needs-kubernetes-and-why-you-should-care-now-5djk</guid>
      <description>&lt;h1&gt;
  
  
  Why AI Sandboxing Needs Kubernetes—And Why You Should Care Now
&lt;/h1&gt;

&lt;p&gt;Last month, Anthropic's Mythos model did something that made security teams everywhere sit up straighter: it autonomously discovered and exploited zero-day vulnerabilities across every major operating system and web browser. We're talking about flaws that survived 27+ years of human scrutiny. One model. One run. Game over.&lt;/p&gt;

&lt;p&gt;If that doesn't immediately make you think about containment, isolation, and security boundaries, it should. And if you're building AI systems—whether you're orchestrating models, deploying inference endpoints, or running autonomous agents—this is the moment to stop treating sandboxing as optional.&lt;/p&gt;

&lt;p&gt;The good news? Kubernetes is becoming the de facto platform for AI sandboxing. And it's not because someone forced it. It's because the problem is so hard that Kubernetes's built-in isolation, resource controls, and multi-tenant abstractions suddenly look essential rather than overengineered.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Vulnerability Crisis That Changed Everything
&lt;/h2&gt;

&lt;p&gt;Why does an AI model finding zero-days matter for &lt;em&gt;your&lt;/em&gt; infrastructure? Because AI systems are no longer passive tools you can fence off with a database credential.&lt;/p&gt;

&lt;p&gt;Models like Mythos operate with network access, file system permissions, and sometimes shell execution rights. They can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Probe systems methodically (and faster than humans)&lt;/li&gt;
&lt;li&gt;Chain exploits together autonomously&lt;/li&gt;
&lt;li&gt;Operate 24/7 without fatigue&lt;/li&gt;
&lt;li&gt;Find patterns humans miss&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When Mythos found that 27-year-old bug, it exposed a hard truth: &lt;strong&gt;isolation matters more than ever&lt;/strong&gt;. You can't patch your way out of a threat that learns. You can only contain it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Kubernetes Is the Answer (Not the Accident)
&lt;/h2&gt;

&lt;p&gt;Kubernetes wasn't designed for AI workloads. But its architecture maps almost perfectly onto the sandboxing problem:&lt;/p&gt;

&lt;h3&gt;
  
  
  Namespace-Level Isolation
&lt;/h3&gt;

&lt;p&gt;Each Kubernetes namespace becomes a security domain. Your model runs in its own namespace with its own RBAC rules, network policies, and resource quotas. A compromised model can't escape into another application.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pod-Level Containment
&lt;/h3&gt;

&lt;p&gt;Pods can enforce:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CPU/memory limits&lt;/strong&gt; — prevent denial-of-service attacks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read-only root filesystems&lt;/strong&gt; — block persistence mechanisms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security contexts&lt;/strong&gt; — disable privilege escalation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network policies&lt;/strong&gt; — restrict egress to approved endpoints&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Observability by Default
&lt;/h3&gt;

&lt;p&gt;Every Kubernetes cluster has built-in logging, metrics, and audit trails. If your model behaves unexpectedly, you see it immediately. This is invaluable for detecting autonomous exploitation attempts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pod&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;model-sandbox&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;inference&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mythos:v1&lt;/span&gt;
    &lt;span class="na"&gt;securityContext&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;readOnlyRootFilesystem&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;runAsNonRoot&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;allowPrivilegeEscalation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
      &lt;span class="na"&gt;capabilities&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;drop&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ALL&lt;/span&gt;
    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4Gi"&lt;/span&gt;
    &lt;span class="na"&gt;volumeMounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tmp&lt;/span&gt;
      &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/tmp&lt;/span&gt;
  &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tmp&lt;/span&gt;
    &lt;span class="na"&gt;emptyDir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;
  &lt;span class="na"&gt;networkPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;model&lt;/span&gt;
    &lt;span class="na"&gt;policyTypes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Egress&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Three-Layer Defense
&lt;/h2&gt;

&lt;p&gt;Your AI sandboxing strategy should look like this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1: Cluster Security&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RBAC policies that prevent models from listing secrets&lt;/li&gt;
&lt;li&gt;Network policies that restrict outbound traffic to whitelisted APIs&lt;/li&gt;
&lt;li&gt;Persistent volume policies that prevent mount escalation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Layer 2: Container Runtime&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use gVisor or Kata containers for heavier isolation if you're running untrusted models&lt;/li&gt;
&lt;li&gt;Consider runtimes with syscall filtering for maximum precision&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Layer 3: Application-Level Governance&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rate limit model API calls&lt;/li&gt;
&lt;li&gt;Monitor for anomalous system calls&lt;/li&gt;
&lt;li&gt;Implement model behavior verification before production&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What You Should Do Monday Morning
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Audit your model deployments.&lt;/strong&gt; Are they running with root? Can they write to the filesystem? Can they reach your internal APIs? If the answer to any of these is "yes," you have a problem.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Set resource limits now.&lt;/strong&gt; Even if you don't implement full sandboxing today, enforcing CPU/memory quotas prevents a runaway model from taking down your cluster.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Enable network policies.&lt;/strong&gt; Default-deny egress. Only allow models to reach the services they need. This is the single biggest win for AI security and takes an afternoon to implement.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Start with one namespace per model family.&lt;/strong&gt; This isn't overkill—it's appropriate caution given what we saw with Mythos.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Uncomfortable Truth
&lt;/h2&gt;

&lt;p&gt;We've entered an era where running AI systems at scale requires infrastructure thinking that was previously reserved for multi-tenant cloud platforms. Kubernetes isn't just a deployment tool anymore for AI teams. It's your perimeter.&lt;/p&gt;

&lt;p&gt;The models are getting smarter. The vulnerabilities are getting deeper. And the platforms that can isolate and observe AI workloads will be the ones that can sleep at night.&lt;/p&gt;

&lt;p&gt;How are you currently isolating AI workloads in your environment? Are you treating model security as infrastructure security yet, or are you still hoping it doesn't matter?&lt;br&gt;
Ref:&lt;a href="https://dev.tourl"&gt; https://www.cncf.io/blog/2026/04/30/ai-sandboxing-is-having-its-kubernetes-moment/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>ai</category>
      <category>security</category>
      <category>devops</category>
    </item>
    <item>
      <title>Kubernetes 1.36: Breaking Free from Container-Level Resource Constraints</title>
      <dc:creator>Pratheesh Satheesh Kumar</dc:creator>
      <pubDate>Thu, 07 May 2026 08:26:42 +0000</pubDate>
      <link>https://dev.to/pratheesh_s/kubernetes-136-breaking-free-from-container-level-resource-constraints-15ll</link>
      <guid>https://dev.to/pratheesh_s/kubernetes-136-breaking-free-from-container-level-resource-constraints-15ll</guid>
      <description>&lt;h2&gt;
  
  
  The Real Cost of One-Size-Fits-All Resource Allocation
&lt;/h2&gt;

&lt;p&gt;You're running a machine learning training job in Kubernetes. Your main container needs exclusive CPU cores, NUMA alignment, and guaranteed memory—every microsecond counts. But your pod also runs three sidecars: a Prometheus exporter using 50m CPU, a log shipper, and a service mesh proxy. Before Kubernetes 1.36, you had two painful options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Allocate exclusive CPUs to every container&lt;/strong&gt;, wasting resources on lightweight sidecars&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Give up on Guaranteed QoS class entirely&lt;/strong&gt;, losing the performance guarantees your primary workload depends on&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Pod-Level Resource Managers (alpha in 1.36) end this false choice. This is the kind of practical improvement that separates "works in dev" from "scales reliably in production."&lt;/p&gt;

&lt;h2&gt;
  
  
  How Pod-Level Resource Managers Actually Work
&lt;/h2&gt;

&lt;p&gt;The enhancement extends the kubelet's CPU, Memory, and Topology Managers from a strict per-container model to a pod-centric allocation strategy. Enable the &lt;code&gt;PodLevelResourceManagers&lt;/code&gt; and &lt;code&gt;PodLevelResources&lt;/code&gt; feature gates, and you unlock &lt;strong&gt;hybrid resource allocation&lt;/strong&gt;—where your primary container gets exclusive, NUMA-aligned resources while sidecars share the burstable pool.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Architecture Shift
&lt;/h3&gt;

&lt;p&gt;Previously, resource managers evaluated each container independently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-training&lt;/span&gt;
    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8"&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16Gi"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prometheus-exporter&lt;/span&gt;
    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;100m"&lt;/span&gt;  &lt;span class="c1"&gt;# Still gets exclusive CPU logic applied&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;128Mi"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, with pod-level resources, you specify allocation intent at the pod level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# New pod-level field&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8"&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16Gi"&lt;/span&gt;
    &lt;span class="na"&gt;managedResources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Which managers handle this&lt;/span&gt;
  &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-training&lt;/span&gt;
    &lt;span class="c1"&gt;# Gets the pod-level resources&lt;/span&gt;
    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8"&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16Gi"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prometheus-exporter&lt;/span&gt;
    &lt;span class="c1"&gt;# Sidecar, uses shared resources&lt;/span&gt;
    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;100m"&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;128Mi"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The kubelet now understands: "Allocate these resources to the pod as a unit, optimizing for the main workload, and let sidecars share what's available."&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Impact: Three Scenarios Where This Matters
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Scenario 1: High-Frequency Trading&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your order-matching engine needs 16 exclusive cores, zero latency variance, and strict NUMA binding. Historically, you'd have to waste 2 full cores on a sidecar collecting metrics. Now sidecars can share resources while the primary container gets guaranteed isolation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario 2: Database Workloads&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;PostgreSQL or RocksDB containers in Kubernetes need predictable page cache behavior. Pod-level allocation lets you assign exclusive CPUs to the database while keeping log collectors and health checkers lightweight and shared.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario 3: ML Inference at Scale&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Serving BERT or GPT models requires precise CPU allocation for tokenizer preprocessing and model serving. With pod-level resources, your GPU-powered model container gets exclusive CPU cores while sidecar authentication and request logging run with elastic resources.&lt;/p&gt;

&lt;h2&gt;
  
  
  Actionable: Testing This Today
&lt;/h2&gt;

&lt;p&gt;If you're running 1.36, here's how to start experimenting:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Enable the feature gates&lt;/strong&gt; in your kubelet config:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubeletExtraArgs:
  feature-gates: &lt;span class="nv"&gt;PodLevelResourceManagers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;,PodLevelResources&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Set the appropriate manager policies&lt;/strong&gt; (static CPU policy required for NUMA alignment):
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubeletExtraArgs:
  cpu-manager-policy: static
  memory-manager-policy: static
  topology-manager-policy: best-effort
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Deploy a test workload&lt;/strong&gt; with explicit pod-level resources and observe kubelet logs for allocation decisions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Monitor NUMA efficiency&lt;/strong&gt; using tools like &lt;code&gt;numastat&lt;/code&gt; on nodes—you should see better locality for resource-managed pods.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What's Still Alpha
&lt;/h2&gt;

&lt;p&gt;This feature is alpha, so expect changes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API shape for &lt;code&gt;.spec.resources&lt;/code&gt; may evolve&lt;/li&gt;
&lt;li&gt;Not all manager combinations are tested (CPU+Memory+Topology interaction especially)&lt;/li&gt;
&lt;li&gt;Upgrade/downgrade paths aren't fully hardened&lt;/li&gt;
&lt;li&gt;Documentation is still being written&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Do NOT use this in production yet&lt;/strong&gt;—but start testing in non-critical clusters now. Alpha feedback shapes the 1.37+ roadmap.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;Pod-level resource managers reflect a maturation in Kubernetes: moving from "one-size-fits-all per-container abstractions" to "workload-aware resource models." This follows the same evolution we've seen with pod disruption budgets, pod scheduling policies, and topology spread constraints.&lt;/p&gt;

&lt;p&gt;The real win? You no longer have to choose between performance guarantees and resource efficiency. Your ML pipelines, databases, and trading systems can have both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's your biggest pain point with resource allocation in Kubernetes today—is it NUMA binding, sidecar overhead, or something else? Share in the comments; early adopter feedback is how features like this get refined.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>platform</category>
      <category>performance</category>
    </item>
    <item>
      <title>Why Two-Thirds of AI Teams Are Betting on Kubernetes (And What That Means for You)</title>
      <dc:creator>Pratheesh Satheesh Kumar</dc:creator>
      <pubDate>Mon, 04 May 2026 02:08:30 +0000</pubDate>
      <link>https://dev.to/pratheesh_s/why-two-thirds-of-ai-teams-are-betting-on-kubernetes-and-what-that-means-for-you-3edo</link>
      <guid>https://dev.to/pratheesh_s/why-two-thirds-of-ai-teams-are-betting-on-kubernetes-and-what-that-means-for-you-3edo</guid>
      <description>&lt;p&gt;Kubernetes and AI have become unlikely bedfellows—and the numbers prove it. New data from CNCF and SlashData reveals that two-thirds of organizations running generative AI models have standardized on Kubernetes for orchestration. But here's the thing: &lt;strong&gt;it's not because Kubernetes magically solves AI problems.&lt;/strong&gt; It's because the engineering fundamentals that make Kubernetes valuable—standardization, repeatability, resource isolation—are exactly what AI workloads demand when they move beyond the laptop and into production.&lt;/p&gt;

&lt;p&gt;If you're building or scaling AI systems, this isn't just trivia. It's a signal about where the industry is converging, and whether Kubernetes is right for you depends less on hype and more on what you're actually trying to accomplish.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Story Behind the Numbers
&lt;/h2&gt;

&lt;p&gt;Let's be clear: Kubernetes didn't become the platform of choice for AI because it was purpose-built for LLMs or model inference. It became the default because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Standardization across teams&lt;/strong&gt;: When you have data scientists, ML engineers, and infrastructure teams all shipping models, Kubernetes provides a common deployment target. No more "it works on my machine" fragmentation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource orchestration&lt;/strong&gt;: AI workloads are hungry. GPUs, accelerators, memory—Kubernetes abstracts these away and lets you define what each model needs without manual provisioning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-tenancy at scale&lt;/strong&gt;: If you're running multiple models for different teams or products, isolation and fair resource allocation become non-negotiable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But here's what the data really highlights: &lt;strong&gt;success with AI still comes down to boring, foundational work.&lt;/strong&gt; The teams winning aren't the ones who found the perfect Kubernetes YAML template. They're the ones with solid internal developer platforms (IDPs), clear observability, and a relentless focus on developer experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  The IDP Question Every AI Team Needs to Answer
&lt;/h2&gt;

&lt;p&gt;The most important implication from this research is the emphasis on internal developer platforms. Here's why:&lt;/p&gt;

&lt;p&gt;AI teams move fast but often lack the operational maturity of traditional backend teams. They want to experiment, iterate, and ship—quickly. But you can't scale that without abstraction.&lt;/p&gt;

&lt;p&gt;An effective IDP for AI sits between your data scientists (who want to ship models) and Kubernetes (which handles the orchestration). It provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Self-service model deployment&lt;/strong&gt;: Data scientists submit a model; the platform handles GPU allocation, versioning, and rollback.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standardized observability&lt;/strong&gt;: Metrics, logs, and traces for inference endpoints—not just for ops, but for the ML team to catch drift and degradation early.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost visibility&lt;/strong&gt;: AI is expensive. Your IDP should show teams exactly what their models cost to run.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example: A simplified model deployment abstraction&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml.company.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ModelEndpoint&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-classifier-prod&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gcr.io/company/gpt-classifier:v2.1.3&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;accelerators&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nvidia.com/gpu:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;32Gi"&lt;/span&gt;
  &lt;span class="na"&gt;autoscaling&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
    &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;targetUtilization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;70&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This layer matters &lt;em&gt;more&lt;/em&gt; than Kubernetes itself. Kubernetes is just the underlying engine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Takeaway: Do You Actually Need Kubernetes for AI?
&lt;/h2&gt;

&lt;p&gt;Honest answer: &lt;strong&gt;probably, eventually.&lt;/strong&gt; But not on day one.&lt;/p&gt;

&lt;p&gt;If you're:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Running a single model for inference with predictable load → managed services (Vertex AI, SageMaker, Modal) might be faster to market.&lt;/li&gt;
&lt;li&gt;Experimenting with models in notebooks → local containers and lightweight orchestration are enough.&lt;/li&gt;
&lt;li&gt;Running multiple models, multiple teams, with variable workloads and cost constraints → Kubernetes becomes the logical choice.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trap is assuming Kubernetes is the goal. It's not. &lt;strong&gt;The goal is reliable, scalable, observable AI systems that developers actually enjoy maintaining.&lt;/strong&gt; Kubernetes is often the best tool for that—but it requires:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Strong foundations first&lt;/strong&gt;: GitOps, infrastructure-as-code, observability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An IDP on top&lt;/strong&gt;: Don't expose Kubernetes complexity to data scientists.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clear resource governance&lt;/strong&gt;: AI compute is expensive; track it ruthlessly.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What's Missing from the Narrative
&lt;/h2&gt;

&lt;p&gt;One thing the data doesn't capture: the operational overhead. Two-thirds of teams use Kubernetes for AI, but we don't know how many are struggling with it. How many are maintaining custom YAML hell? How many have visibility into whether their GPU allocation actually makes sense?&lt;/p&gt;

&lt;p&gt;The fact that two-thirds converge on Kubernetes is less about it being perfect and more about it being the least bad option at scale. That's important context.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Kubernetes isn't demanded by AI. It's &lt;em&gt;enabled&lt;/em&gt; by AI teams that have mature engineering practices and the discipline to build abstractions on top of it.&lt;/p&gt;

&lt;p&gt;If you're starting an AI project, ask yourself: Do we have the fundamentals in place? Do we have an IDP or the plan to build one? If the answer is "not yet," Kubernetes can wait. If you're already managing multiple models across teams, you're probably not far from needing it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's your experience?&lt;/strong&gt; Are you running AI workloads on Kubernetes? What would have made the journey smoother—and what would you do differently next time?&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>ai</category>
      <category>platform</category>
      <category>devops</category>
    </item>
    <item>
      <title>How Cloudflare Built Resilience: Lessons from Their Infrastructure Overhaul</title>
      <dc:creator>Pratheesh Satheesh Kumar</dc:creator>
      <pubDate>Sun, 03 May 2026 04:06:18 +0000</pubDate>
      <link>https://dev.to/pratheesh_s/how-cloudflare-built-resilience-lessons-from-their-infrastructure-overhaul-4oef</link>
      <guid>https://dev.to/pratheesh_s/how-cloudflare-built-resilience-lessons-from-their-infrastructure-overhaul-4oef</guid>
      <description>&lt;h1&gt;
  
  
  How Cloudflare Built Resilience: Lessons from Their Infrastructure Overhaul
&lt;/h1&gt;

&lt;p&gt;When a single misconfiguration can cascade across a global CDN and take down customer traffic, every deployment becomes a high-stakes decision. Cloudflare recently completed a massive push to make their infrastructure fundamentally more resilient—and their approach offers critical lessons for anyone operating at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Risk Concentrates in Configuration
&lt;/h2&gt;

&lt;p&gt;Most infrastructure incidents don't happen because of hardware failures or clever attacks. They happen because someone pushed a configuration change, the change propagated faster than expected, and there was no circuit breaker in between.&lt;/p&gt;

&lt;p&gt;Cloudflare's situation was familiar to anyone running global-scale systems: their engineering teams were shipping improvements constantly, but each deployment carried latent risk. A small mistake in a configuration file could reach millions of users before detection. The traditional guardrails—code review, staging tests, gradual rollouts—weren't enough to catch every edge case.&lt;/p&gt;

&lt;p&gt;This is why they launched "Fail Small," an engineering initiative focused on preventing large-scale incidents by making small failures impossible to propagate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Two-Tool Foundation: Snapstone and Engineering Codex
&lt;/h2&gt;

&lt;p&gt;The solution wasn't a single tool. Instead, Cloudflare invested in two complementary systems:&lt;/p&gt;

&lt;h3&gt;
  
  
  Snapstone: Safer Configuration Changes
&lt;/h3&gt;

&lt;p&gt;Snapstone is a configuration validation and deployment framework that treats configuration changes with the same rigor as code deployments. Here's what makes it different:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pre-flight validation&lt;/strong&gt;: Changes are tested against historical traffic patterns and failure scenarios before rollout&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Staged rollout control&lt;/strong&gt;: Configuration doesn't flip globally—it rolls out in waves with automated rollback if anomalies appear&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Change hygiene&lt;/strong&gt;: Every configuration change is tagged with context: who changed it, why, what it affects, and what the rollback plan is&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it as infrastructure-as-code discipline applied to runtime configuration. The payoff is measurable: configuration-related incidents drop significantly because bad changes simply don't reach production simultaneously across all regions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Engineering Codex: Embedding Best Practices
&lt;/h3&gt;

&lt;p&gt;Tools alone don't prevent incidents—culture does. The Engineering Codex is Cloudflare's answer: a formalized knowledge base of "how we safely operate infrastructure" that's embedded into workflows.&lt;/p&gt;

&lt;p&gt;When engineers write configuration or deploy services, they're nudged toward patterns that have been proven safe:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deployment templates that encode retry logic and timeout handling&lt;/li&gt;
&lt;li&gt;Configuration examples that highlight common failure modes&lt;/li&gt;
&lt;li&gt;Runbooks that appear automatically when certain alerts fire&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's not gatekeeping. It's scaffolding. New engineers learn the "right way" by default, and experienced engineers can deviate with confidence because they understand the underlying principles.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters Beyond Cloudflare
&lt;/h2&gt;

&lt;p&gt;You might think: "Sure, this makes sense for a global CDN. But we're running a smaller operation." That's exactly backward.&lt;/p&gt;

&lt;p&gt;Cloudflare's insight applies &lt;em&gt;especially&lt;/em&gt; to smaller teams:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Your blast radius is fixed regardless of team size&lt;/strong&gt;. A misconfigured load balancer breaks things just as hard at a 50-person startup as at Cloudflare.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You have fewer engineers to catch mistakes&lt;/strong&gt;. Automation and frameworks matter more when you don't have five people reviewing every change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incidents are more expensive relative to revenue&lt;/strong&gt;. A 2-hour outage costs a larger company less (relatively) than a small startup.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Fail Small philosophy: &lt;em&gt;Make the safe path the default path.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Actionable Takeaway: Start With Configuration as Code
&lt;/h2&gt;

&lt;p&gt;If you take one thing from Cloudflare's approach, it's this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Treat configuration changes with the same discipline as code deployments.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Today:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Audit your current configuration management. Is it in version control? Are changes tested before rollout? Is there a rollback procedure?&lt;/li&gt;
&lt;li&gt;Identify your highest-risk configuration files (anything that affects traffic routing, authentication, or resource limits).&lt;/li&gt;
&lt;li&gt;Implement one simple control: all changes to critical configuration must be reviewed and tested in staging before production rollout.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You don't need to build Snapstone from scratch. Tools like Terraform, ArgoCD, or even careful GitOps practices get you 80% of the way there.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bigger Picture: Resilience is Systematic
&lt;/h2&gt;

&lt;p&gt;Cloudflare's Fail Small initiative reminds us that infrastructure resilience isn't about heroic incident response. It's about making bad outcomes progressively harder to achieve.&lt;/p&gt;

&lt;p&gt;Each control they added—validation, staged rollouts, embedded best practices—removes one more degree of freedom from the "I broke production" state space.&lt;/p&gt;

&lt;p&gt;What's one configuration change that could take down your service right now? How many approval gates stand between someone and deploying it? That's where to start.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What's your team's biggest source of configuration-related incidents? Have you invested in preventing them, or mostly in recovering from them? Drop your thoughts below.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cloud</category>
      <category>cicd</category>
      <category>platform</category>
    </item>
  </channel>
</rss>
