DEV Community: Tigera Inc

NVIDIA OpenShell Secures the Agent. Who Governs the Fleet?

Alister Baroi — Wed, 15 Jul 2026 14:41:56 +0000

Most attempts to control AI agents work at the model layer (alignment, system prompts) or the application layer (guardrail libraries, output filters). Both share a flaw: the thing being secured is also the thing doing the securing. A sufficiently confused or sufficiently compromised agent can talk its way past its own instructions.

OpenShell takes a different position, and it is the right one. Put the controls in the environment, where the agent cannot negotiate with them. An agent inside an OpenShell sandbox cannot leak a credential it never received, and cannot call an endpoint the kernel refuses to route.

If that argument sounds familiar, it should. It is the same case we made in Why We Built Lynx and throughout the AI agent accountability series: controls the agent can override are not controls. NVIDIA arriving at the same conclusion, with an Apache 2.0 project and a partner list that includes Cisco, CrowdStrike, Google Cloud, and Microsoft Security, is the strongest endorsement the environment-layer approach has had yet.

So this is not a “versus” post. OpenShell and Lynx solve different halves of the same problem, and NVIDIA said so first: its own launch announcement says securing autonomous systems “requires an integrated ecosystem”.

What OpenShell actually does

OpenShell is a secure runtime for a single agent on a single machine. You install it with one command, then launch an agent inside a sandbox:

openshell sandbox create -- claude

That agent (Claude Code, Codex, Cursor, OpenCode, or your own container image) now runs inside an isolated environment governed by a declarative YAML policy with four layers:

Filesystem: Which paths the agent can read or write, enforced with Landlock and locked at sandbox creation.
Process: Which binaries can execute and which syscalls are available, enforced with seccomp. An agent can install a verified skill but cannot run an unreviewed binary.
Network: deny-by-default egress, intercepted at the HTTP method and path level, hot-reloadable as approvals are granted.
Inference: A “Privacy Router” that decides which LLM backend serves each call, keeping sensitive context on local models and routing to frontier models only when policy allows. Credentials are swapped at the router, so the real API key never sits inside the sandbox.

The threat model is specific and well chosen: long-running, self-evolving agents with shell access, live credentials, and the ability to rewrite their own code. Prompt injection, malicious third-party skills, subagents inheriting permissions they should not have. When the agent hits a policy wall, it can propose a policy change and a human approves or rejects it. Autonomy with a human holding the pen.

It is currently alpha (“proof of life,” in NVIDIA’s words), runs on macOS, Windows via WSL 2, and Linux, and targets everything from a developer laptop to DGX-class machines.

Where OpenShell stops, on purpose

Here is the part that matters for anyone running agents in production. NVIDIA’s technical documentation is explicit about what OpenShell does not address:

Agent-to-agent communication governance
Agent identity and authentication
Cross-sandbox communication patterns

Kubernetes is the near-miss on that list. OpenShell does run on Kubernetes: an experimental Helm chart, marked not for production, that provisions sandbox pods on a cluster. But putting sandboxes on Kubernetes and governing a fleet across Kubernetes are different jobs. Each sandbox still enforces its own YAML in isolation, with no shared agent identity and no view of its neighbors.

Read that list again. It is not a gap NVIDIA missed; it is a boundary they drew deliberately, and they drew it exactly where the fleet begins. OpenShell answers “what can this agent do on this box?” It does not attempt to answer “which of my two hundred agents called the payments MCP server last Tuesday, under whose authority, and using which model?” And as we argued in The AI Agent Accountability Gap, network policies, API gateways, and RBAC cannot answer those questions either.

They are the questions Lynx exists for. Side by side:

Concern	OpenShell	Lynx
Scope	One agent, one sandbox	A fleet of agents across a cluster
Agent identity & authentication	No first-class agent identity (users and components authenticate; agents just get injected credentials)	SPIFFE/SPIRE workload identity, mTLS, per-agent JWTs
Policy	YAML per sandbox	Cedar policy across agents, MCP servers, and LLM providers
A2A and MCP traffic	Out of scope	Gateway proxy, every request authorized individually
Agents you didn’t launch	Not applicable	eBPF detection classifies them as sanctioned, shadow, or unknown
Audit	Local allow/deny logs per sandbox	Fleet-wide Agent Trail, including which model actually served each call

One box, two hundred boxes. Same philosophy, different altitude.

Closing the gap: three integration patterns

None of these require code changes in either product. They use configuration surfaces both systems document today: OpenShell’s deny-by-default egress policy and credential injection on one side, Lynx’s gateway, registry, and token service on the other. To be clear about what this is: a proposed reference architecture drawn from published documentation, not a tested walkthrough. OpenShell is weeks old and still alpha. But the seams line up well enough that I think the patterns are worth writing down now.

Pattern 1: One road out of the sandbox

OpenShell intercepts all outbound traffic and denies by default. So write the narrowest useful network policy: the only egress a sandbox is allowed is the Lynx Agent Gateway.

Every MCP call, every A2A request, every LLM call now has exactly one path, and that path runs through Cedar authorization on a per-request basis, with the decision recorded in Agent Trail. The division of labor is clean. OpenShell guarantees the agent cannot go around the gateway, even if it is compromised and actively trying. Lynx decides what is allowed through the gateway, and remembers what happened.

Neither system can do the other’s job here. Lynx cannot stop a process inside someone’s laptop sandbox from opening a raw connection; OpenShell can. OpenShell has no idea whether this agent should be allowed to call that MCP tool with those arguments, but Lynx does.

Pattern 2: The API key never enters the sandbox

OpenShell’s Privacy Router already routes inference calls through controlled backends and swaps credentials on the way out. Lynx, as of the current release cycle, treats LLM providers as first-class governed entities: registered in the registry, subject to Cedar policy, visible on the access map, recorded in Agent Trail down to the model that actually served the request.

Chain them. Local-model traffic stays on the box, served by Nemotron or whatever the Privacy Router prefers. Frontier-model traffic routes to the Lynx LLM gateway, where Cedar decides which agent may use which provider and which model, and the credential is attached centrally.

Follow the key. The OpenAI or Anthropic API key exists in exactly one place, inside Lynx. Not in the sandbox, not in the agent’s environment variables, not in a dotfile the agent can read and exfiltrate. And every frontier call, from every sandbox on every developer machine, lands in one audit trail with the caller’s identity and the served model attached. A prompt-injected agent can ask for the key all it wants; there is nothing on the box to steal.

Pattern 3: Identity from birth

OpenShell deliberately focuses on securely running agents rather than defining who those agents are. It provides sandboxing, credential management, and integration with existing identity systems, but it doesn’t maintain a persistent registry of agent identities or establish a trust model between agents. Lynx complements that layer by giving every agent a verifiable identity from the moment it is created.

The integration is intentionally lightweight: a wrapper around openshell sandbox create registers the new agent with the Lynx registry and associates it with an existing workload identity; whether SPIFFE, OIDC, or another supported mechanism. From its first network request, the sandbox represents a known, authenticated agent rather than an anonymous process.

This pattern is what makes the first two enforceable per agent instead of per box, and it has a side effect worth naming. A developer’s local experiment, sandboxed with OpenShell and registered with Lynx, shows up on your access map as a sanctioned agent. The same experiment without registration is exactly the shadow agent that Lynx’s eBPF detection was built to catch. Registration at sandbox creation makes the sanctioned path the lazy path, which is the only kind of security policy developers reliably follow.

Same policy idea, from laptop to cluster

There is a deeper symmetry underneath these patterns. OpenShell’s filesystem and process layers do at sandbox scope roughly what Lynx’s agent-detector does at node scope with eBPF; its network and inference layers do locally what the Lynx gateway does for the fleet with Cedar. (Peter Kelly covered the gateway-plus-kernel enforcement model in Multi-Layer Policy for Securing AI Agents.) Nobody has built a translator between OpenShell YAML and Cedar yet. But the layers correspond closely enough that policy parity across the laptop-to-cluster boundary looks like an engineering problem, not a research problem. An agent developed under a given OpenShell policy could be promoted to Kubernetes with the same intent expressed as Cedar plus a quarantine baseline. That is the roadmap conversation this post is meant to start.

Two smaller threads point the same direction. OpenShell’s Kubernetes chart means sandboxes can run on a Lynx-governed cluster, sitting inside two independent kernel enforcement planes, one inside the sandbox and one on the node, so even a sandbox escape lands in Lynx’s detection perimeter. And OpenShell logs every allow/deny decision locally; forwarding those over OTLP into Agent Trail would put runtime decisions and traffic decisions in a single timeline. Both are speculative today. Neither is far-fetched.

The other half

OpenShell is the most credible answer yet to a question we have been asking all year: how do you give an agent real autonomy without handing it the keys to the host? If you are running coding agents locally, try it; the install is two commands and the defaults are sensible.

Then ask the question NVIDIA deliberately left open. When that agent, and the forty like it across your organization, start talking to MCP servers, to each other, and to three different LLM providers, who is checking identity at the door? Whose policy decides, and where is the record? (Five Principles of an Accountable AI Agent Network is the checklist for evaluating whatever answers you get.)

OpenShell holds the agent. Lynx governs the fleet. The seam between them is thinner than you would expect, and the patterns above are how we would stitch it.

_Lynx is Tigera’s security and governance platform for AI agents on Kubernetes: identity, policy, detection, and audit for every agent in your cluster. Read How Lynx Works or schedule a demo at tigera.io/demo/. _

Ready to see Lynx in action? Schedule a demo.

The post NVIDIA OpenShell Secures the Agent. Who Governs the Fleet? appeared first on Tigera – Creator of Calico.

Tiered Network Policy: Scaling Kubernetes Security

Alister Baroi — Fri, 10 Jul 2026 16:12:22 +0000

As Kubernetes clusters scale from a few development sandboxes to massive, multi-tenant production environments, platform teams often find themselves facing a configuration management crisis. A small number of microservices suddenly demand hundreds of individual Kubernetes NetworkPolicy objects. Managing them becomes operationally expensive, auditing them is difficult, and a single developer misconfiguration can easily drop critical production traffic or open a massive security hole.

To scale cluster security without slowing down engineering velocity, we must abandon the flat, uncoordinated rule planes of the past. The solution lies in establishing a clear, multi-layered framework: a hierarchy of trust powered by tiered network policies.

The Core Problem with Standard Kubernetes NetworkPolicy

Standard Kubernetes NetworkPolicy resources are genuinely useful for basic application microsegmentation, but they have major architectural and organizational bottlenecks when scaled across an enterprise:

Namespace-Scoped by Design: Standard network policies are inherently scoped to a namespace. If your security team mandates a cluster-wide rule, such as blocking all internal pods from querying the cloud provider’s metadata API (169.254.169.254), you have to copy-paste that policy into every single namespace. If a developer creates a new namespace, that guardrail doesn’t exist until someone manually applies it.
Organizational Friction: Because anyone with namespace access can manipulate these policies, it creates a persona gap within organizations. Platform & Security teams need to enforce global, un-overrideable guardrails (e.g. “Isolate the payments namespace from everything else”). DevOps teams need the freedom to write granular, service-to-service rules for their applications without opening infrastructure support tickets.
No Rules Hierarchy: Kubernetes network policies are strictly additive. There are no weights, priorities, or order sequences. An application developer can accidentally (or intentionally) write a loose policy that bypasses the security team’s intended restrictions, undermining any baseline trust.
The “Allow-Only” Restriction: Standard policies cannot explicitly Deny traffic. They operate solely on an allow-list model. Isolation is implicit: if a pod is selected by a policy, any traffic not explicitly allow-listed is dropped. This makes it impossible to write a simple, top-level rule that says, “Block traffic from Namespace X to Namespace Y, no matter what.”

What a Scalable Solution Requires

To solve these scaling pain points, we have to move away from a flat network architecture and adopt a Tiered Policy Model. A scalable solution requires four core capabilities:

Global, Cluster-Wide Scope: To stop copy-pasting rules, administrators need a policy type that natively operates at the cluster level rather than the namespace level. This allows a single manifest to apply to all current and future namespaces automatically, eliminating the risk of “configuration drift” and ensuring day-one protection for new workloads.
Separation of Concerns (RBAC-Gated Tiers): Security, platform, and application teams need their own distinct logical “zones” or tiers to deploy rules. These tiers must be strictly gated by Role-Based Access Control (RBAC) so a developer modifying their application namespace cannot alter or override a higher-priority platform or security tier.
Deterministic, Top-Down Evaluation: The firewall engine must evaluate these tiers sequentially. Traffic must pass through the highest-priority tier (e.g., Security) before it ever reaches a lower tier (e.g., Application).
Explicit Deny and Pass Actions: Standard policies are allow-only, so they can never express a hard “block this, period.” A tiered model needs explicit actions: a Deny that states a prohibition outright, and a third option, Pass, that lets one tier defer the decision to the next rather than ending it (covered in detail below).

Why the Pass Action Matters

The key enabler of tiered policies is the Pass action. Think of Pass as a delegated hand-off. When a packet matches a rule with a Pass action in a high-priority tier, the engine skips the remaining lower-precedence rules in that tier and continues evaluation in the next tier down the hierarchy. This allows security administrators to say: “This traffic is safe by our standards, but we aren’t explicitly endorsing it. We are passing the final decision down to the platform or development teams to handle at their layer.” Without a Pass action, tiered policies become brittle, forcing admins to explicitly track and approve every single microservice connection at the highest level, which would defeat the purpose of developer agility.

The Kubernetes Native Answer: ClusterNetworkPolicy

Recognizing these scalability constraints, the SIG-Network Policy API group developed a native, multi-layered solution: ClusterNetworkPolicy. The API delivers exactly the four capabilities outlined above, with a few concrete specifics worth calling out:

A Native Three-Layer Hierarchy: It introduces distinct, sequentially evaluated resource tiers. ClusterNetworkPolicy (Admin tier) at the top for absolute guardrails, standard NetworkPolicy in the middle for developer agility, and ClusterNetworkPolicy (Baseline tier) at the bottom as a cluster-wide fallback safety net. Unlike namespace-jailed standard policies, the Admin and Baseline tiers apply across the entire cluster.
Separation of Concerns: Because ClusterNetworkPolicy is delivered as a new Custom Resource Definition (CRD) rather than a tweak to the existing NetworkPolicy type, standard Kubernetes RBAC governs who can interact with it. This ensures that only Security/Platform teams access ClusterNetworkPolicy resources, while DevOps teams work only with namespaced network policies.
Numeric Precedence: Policies feature explicit integer priorities. A policy with a lower integer value (e.g., 10) takes precedence over a policy with a higher value (e.g., 100), allowing for deterministic evaluation.
Explicit Actions: Rules are no longer purely additive. You can now design rules with explicit Accept, Deny, and Pass actions.

This API completely shifts how cluster administrators manage traffic by introducing a native, three-tiered evaluation hierarchy:

The Top Layer: ClusterNetworkPolicy (Admin tier): This is the high-priority tier controlled by cluster and security administrators. Rules here are evaluated first, and two of its three actions are terminal: an Accept or a Deny is a final verdict that bypasses the developer’s NetworkPolicy layer entirely. A Deny here cannot be overridden by any developer manifest, but the same is true of Accept: if an admin explicitly accepts traffic, it is permitted regardless of what a developer policy would have decided. This is the crucial difference from a standard NetworkPolicy allow, which is additive. An Admin-tier Accept is an override, not a contribution. Only the third action, Pass, is non-terminal: it declines to decide and hands evaluation down to the next tier.

As an example, the following ClusterNetworkPolicy can be used to allow DNS UDP traffic toward kube-dns from all namespaces:

apiVersion: policy.networking.k8s.io/v1alpha2
kind: ClusterNetworkPolicy
metadata:
  name: allow-dns-to-kube-dns
spec:
  tier: Admin
  priority: 100
  subject:
    namespaces:
      matchLabels: {}
  egress:
    - name: allow-dns
      action: Accept
      to:
        - pods:
            namespaceSelector:
              matchLabels:
                kubernetes.io/metadata.name: kube-system
            podSelector:
              matchLabels:
                k8s-app: kube-dns
      protocols:
        - udp:
            destinationPort:
              number: 53

The Middle Layer: Standard NetworkPolicy: This is the traditional application-developer tier. It only kicks in if traffic wasn’t explicitly allowed or denied by the ClusterNetworkPolicy in the Admin tier above it. This keeps developers agile, letting them connect their microservices without needing admin intervention. One subtlety to keep in mind: standard NetworkPolicy carries an implicit deny for any pod it selects. So traffic only falls through to the Baseline tier when no NetworkPolicy selects the workload at all. A pod that is selected but matches none of its Accept rules is already dropped here, and never reaches the Baseline tier below. The following network policy can be used to permit ingress HTTP traffic for the awesome-app namespace.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-http-ingress
  namespace: awesome-app
spec:
  podSelector:
    matchLabels:
      app: http-server
  policyTypes:
    - Ingress
  ingress:
    - ports:
       - protocol: TCP
         port: 80

The Bottom Layer: ClusterNetworkPolicy (Baseline tier): This is the cluster-scoped Baseline tier, meant for default fallbacks. It acts as the safety net after developer policies are checked. For example, if a developer forgets to secure their pod, this policy can enforce a default cluster-wide posture like “if no developer policy matches this traffic, deny all intra-cluster traffic by default.”. The following ClusterNetworkPolicy would satisfy this requirement:

apiVersion: policy.networking.k8s.io/v1alpha2
kind: ClusterNetworkPolicy
metadata:
  name: deny-all
spec:
  tier: Baseline
  priority: 1
  subject:
    namespaces:
      matchLabels: {}
  ingress:
  - name: deny-all-ingress
    action: Deny
    from:
    - namespaces:
        matchLabels: {}

Combined, these features provide a native, multi-level strategy for scaling enterprise cluster security far beyond the limitations of a flat configuration.

Extending the Model: Calico Tiers

While the native Kubernetes APIs introduce a better three-layer model, and some control over rule priority, enterprise environments often require finer granularity. Calico expands on this concept by offering unlimited policy tiers, allowing you to design an arbitrary number of custom evaluation layers. Calico tiers will be discussed in the next post.

Get started with an interactive demo: DNS Policy with Calico

The post Tiered Network Policy: Scaling Kubernetes Security appeared first on Tigera – Creator of Calico.

Save the Address, Save the Cloud: A Hands-on KubeVirt Live Migration Workshop

Alister Baroi — Thu, 09 Jul 2026 13:58:26 +0000

In the previous post in this series, we covered why Virtual Machine (VM) Live Migration in Kubernetes is difficult: a VM’s IP is its identity, and the “new” VM on the destination node has to come up with the same IP, this something that Kubernetes is not known for, and on top of that, traffic has to switch over only after network security policies are in place. Calico v3.32.0 delivers all the above and allows you to Live Migrate a VM without any network disruptions and this post is a short, do-it-yourself workshop to achieve it.

In about 5 minutes you’ll bring up a 3-node cluster, install Calico + KubeVirt, run a VM, and migrate it live.

Requirements

A Linux or a Windows Machine preferably WSL2 ( Mac Is not supported by KubeVirt )
Docker or Podman with at least 8 GB RAM
kubectl
KIND (v0.31.0)
virtctl (v1.8.2)

Note: In many Linux distros the default for most kernel parameters are too low, for a kind cluster running KubeVirt. Use the following command to temporarily increase these limits.

sudo sysctl -w fs.inotify.max_user_instances=2048
sudo sysctl -w fs.inotify.max_user_watches=1048576

If you face any challenges during the KubeVirt live migration, make sure to drop by our Slack to ask your questions.

Create a multi-node cluster

By default KIND is shipped with a simple default CNI, use the following command to disable the default CNI and create the demo cluster:

kind create cluster --config -<<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: calico-lab
nodes:
  - role: control-plane
  - role: worker
  - role: worker
networking:
  disableDefaultCNI: true
  podSubnet: 192.168.0.0/16
EOF

Install Calico

Live local VM migration is part of Calico v3.32.0 release and it’s important that you install or upgrade to this specific version. If you are already running Calico Unified Platform in your environment skip this part and go directly to the “Version and feature verifications” step there you can check your version of Calico.

Use the following command to install Tigera Operator:

kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.32.0/manifests/tigera-operator.yaml

Wait for the rollout to complete:

kubectl -n tigera-operator rollout status deploy/tigera-operator --timeout=2m

Next, create the installation resource:

kubectl create -f - <<'EOF'
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  kubeletVolumePluginPath: None
  calicoNetwork:
    bgp: Enabled
    ipPools:
    - blockSize: 26
      cidr: 192.168.0.0/16
      encapsulation: IPIP
      natOutgoing: Enabled
      nodeSelector: all()
---
apiVersion: operator.tigera.io/v1
kind: APIServer
metadata:
  name: default
EOF

Wait for Calico installation to finish, you can verify that by running the following command:

kubectl wait --for=condition=Available tigerastatus/calico --timeout=2m

Install KubeVirt

To extend Kubernetes to manage stateful virtual machines just like ordinary containers, you first need to install KubeVirt, which acts as the crucial abstraction layer between your cluster and the underlying QEMU emulator.

Use the following command to install KubeVirt

kubectl create -f https://github.com/kubevirt/kubevirt/releases/download/v1.8.2/kubevirt-operator.yaml
kubectl create -f https://github.com/kubevirt/kubevirt/releases/download/v1.8.2/kubevirt-cr.yaml

Use the following command

kubectl -n kubevirt rollout status deploy/virt-operator --timeout=5m

Preparing KubeVirt

To prepare the cluster for live migration, we must first configure KubeVirt to enable bridge networking on the pod network. This is the only networking mode that allows Calico to successfully persist a VM’s IP address across nodes. The permitBridgeInterfaceOnPodNetwork flag is a cluster-wide configuration in KubeVirt that determines whether a Virtual Machine (VM) can utilize the bridge interface type for its default pod network. While this is often set to true by default, cluster administrators sometimes disable it (set it to false) for security or architectural reasons.

kubectl -n kubevirt patch kubevirt kubevirt --type=merge -p '{
  "spec": {"configuration": {
    "developerConfiguration": {"useEmulation": true,"clusterProfiler": true},
    "network": {"defaultNetworkInterface": "bridge", "permitBridgeInterfaceOnPodNetwork": true}
  }}}'

After configuration is in place KubeVirt will spawn handler and API pods, this can take some time depending on your machine.

Use the following command to make sure KubeVirt deployment is complete:

kubectl -n kubevirt wait --for=condition=Available kubevirt/kubevirt --timeout=10m

Create a VM

Two things make this VM migratable: bridge: {} networking, and the allow-pod-bridge-network-live-migration annotation (KubeVirt blocks bridge-mode migration without it).

Use the following command to create a VM:

kubectl create -f https://raw.githubusercontent.com/frozenprocess/kubevirt-migration-observer/main/examples/vm.yaml

Live VM Migration

Live VM migration is a marathon relay, there are multiple KubeVirt and Calico components that work together in order to make this migration happen and the beauty of this integration is that all the complexity is hidden behind a single command virtctl migrate. While Calico Unified Platform is heavily involved in the security and networking side of a VM migration process, KubeVirt handles the compute lifecycle, specifically racing the guest’s memory across the wire and cutting the CPU over to the new node.

To better understand this dance let’s use the KubeVirt observer app, this app will gather all the information regarding your cluster during the migration and organize it in a searchable way.

Use the following command to deploy the observer app inside the cluster:

kubectl create -f https://raw.githubusercontent.com/frozenprocess/kubevirt-migration-observer/main/examples/observer-job.yaml

After observer is running

virtctl migrate demo-vm

Gathering The Report

To make sure that the report is generated use the following command to take a peak at the observer status:

kubectl logs -l job-name=kubevirt-migration-observer

The expected result should be the following:

report written:

report written:
  markdown: /work/reports/demo-vm-20260604T232645Z.md
  json : /work/reports/demo-vm-20260604T232645Z.json
  html : /work/reports/demo-vm-20260604T232645Z.html
  audit : /work/reports/demo-vm-20260604T232645Z-audit/audit.md
[observer] report written to /work/reports; holding 3600s for kubectl cp

Use the following command to copy the report to your workstation:

pod=$(kubectl -n default get pod -l app=kubevirt-migration-observer -o name)
kubectl -n default cp "${pod#pod/}:/work/reports" ./reports

Now head over to the reports folder on your local machine where you executed the command and examine the report.

Note: observer app also has the ability to collect performance logs, and flamegraphs. If you are interested in running a full VM migration profile checkout the full tutorial here.

The following table compares two independent migration reports:

data plane Configuration	Cutover VM Downtime (Via a TCP Probe)	Total Migration Time
BGP + IP-in-IP	0s (None observed)	1m 13.7s
VXLAN + BGP	1s	1m 44.5s

Clean up

Run the following command to delete the demo environment:

kind delete clusters calico-lab

Conclusion

Three resources do all the heavy lifting: the kubeVirtVMAddressPersistence setting on Calico’s IPAM config, the allow-pod-bridge-network-live-migration annotation on the VM, and bridge-mode networking so the VM uses the pod IP directly. Get those right and a stateful VM moves between machines with its TCP connections open and its identity intact. The observer just makes the proof visible.

Try Calico VM migration in your browser

The post Save the Address, Save the Cloud: A Hands-on KubeVirt Live Migration Workshop appeared first on Tigera – Creator of Calico.

Save the Address, Save the Cloud (KubeVirt VM Migration Story)

Alister Baroi — Wed, 08 Jul 2026 20:45:58 +0000

Kubernetes is built for containers, and it’s been doing that since it used to run docker as an engine for its containers. But what if you want to add VMs to the mix? After all, containers are ephemeral and don’t require fixed IPs as they shift the identity toward labels, but VMs on the other hand are tied to IP addresses and in some cases MAC addresses.

This brings us to this blog about VM migration and IP preservation. Unlike a pod that can be part of a deployment and run in a swarm of stateless endpoints, a VM is a stateful machine run by hypervisor like QEMU and extended to Kubernetes via KubeVirt Custom Resource Definitions (CRDs).

There Is Something About KubeVirt

KubeVirt is an abstraction layer between the underlying hypervisor (QEMU) on your machine and Kubernetes. Its job is to manage a VM’s lifecycle and provide the necessary requirements for a VM to be a native resident in Kubernetes. These requirements are CPU, Memory, Networking, etc.

KubeVirt does this by wrapping each VM in an ordinary Kubernetes pod called virt-launcher. Inside that pod, KubeVirt runs libvirt and QEMU, and the “VM” is really just a process scheduled, networked, and accounted for like any other pod. That detail matters a lot once we get to migration: when a VM moves to another node, what Kubernetes actually does is create a brand-new virt-launcher pod on the destination and tear down the old one. Everything hard about live migration comes from making that pod swap invisible to the workload running inside.

CPU

CPU is the part that does the actual work, every instruction the guest operating system and its applications execute runs on a virtual CPU that KubeVirt maps onto real cores of the host node. You can pin the VM to dedicated cores, expose host CPU features, or let it float over shared cores. For migration, the CPU matters for a subtle reason: while a VM is being moved, its CPU keeps running and keeps changing memory. The faster the guest dirties memory, the harder it is to copy that memory to the other node before it changes again. We’ll come back to this race in a moment.

Memory

Other than being expensive these days, RAM or Memory has a crucial role in VM migration, since it is the place where everything that the CPU is working on is stored and referenced. In a physical computer, memory is the expensive stick that you buy and install in your computer. However, in a VM, memory is a region of your computer’s RAM allocated for the tasks that the VM is actively working on.

Memory is the thing migration is really about. When KubeVirt live-migrates a VM, the bulk of what it ships from the source node to the destination node is the VM’s RAM, gigabytes of it, while the guest keeps running and keeps writing to it.

Networking

Another important part of a VM is networking, and KubeVirt supports multiple networking modes. Our focus is going to be on bridge, since that is required for VM migration with Calico, but if you’d like to learn more about other modes feel free to check out the official KubeVirt documentation.

What is a bridge?

A bridge is similar to a playground where all the resources connected to it are able to communicate with each other. In Linux, a software bridge is a virtual switch: you plug interfaces into it and it forwards Ethernet frames between them just like a physical switch would.

In KubeVirt’s bridge mode, the VM is connected to the pod network through a Linux bridge, and the pod’s IP address is handed down to the VM itself (via DHCP). The guest doesn’t get some separate, NAT’d address, it uses the pod IP directly as its own. If the relevant pod interface has a MAC address and the VM doesn’t override it, the VM inherits that MAC too.

That “VM uses the pod IP directly” property is exactly why bridge mode is the only mode that works for live migration with Calico, and it’s the hinge the rest of this post turns on.

How local live migration actually works

Local live migration is the process of moving a running VM from one node to another within a cluster while the guest keeps running and stays reachable. No reboot, no shutdown, ideally the application inside never even notices.

You start the process by posting a VirtualMachineInstanceMigration (VMIM) object, or just running virtctl migrate vm1, and KubeVirt does the rest. The default strategy is pre-copy, and it works roughly like this:

A new target VM (a fresh virt-launcher pod) is created on the destination node, while the source machine is still running.
The source starts streaming chunks of VM state, mostly RAM, to the target. This repeats: pages that the guest dirties while the copy is in flight get re-sent.
Once enough state has transferred that only a tiny delta remains, the guest is briefly paused, the last pages are shipped, and the guest resumes on the target.
The source VM is removed.

Keep in mind that the VM migration is a race between the source and your network speed. If the source is copying memory to the target and the guest is simultaneously adding new things to that memory and this rate is faster than the network can copy them, the migration may never converge. KubeVirt has knobs for that (auto-converge throttles the guest CPU; post-copy runs the guest on the target immediately and faults memory across on demand), but for most workloads pre-copy just works, and that’s what we’ll see in the real report below.

KubeVirt is deliberately conservative here. Out of the box it runs at most 5 migrations in parallel cluster-wide, no more than 2 outbound per node, and caps each migration at 64 MiB/s of bandwidth, so a busy cluster doesn’t saturate its own network moving VMs around.

The hard part: the VM has to keep its IP

Here’s the thing the KubeVirt docs gloss over and where Calico Unified Platform does the heavy lifting. Remember that a migration is really a pod swap, old virt-launcher pod on node A dies, and a new one on node B is born. Normally, a brand-new pod means a brand-new IP. For a stateless web pod behind a Service, who cares. For a VM, the IP is its identity, every open TCP connection, every client that resolves it, every firewall rule references that address. Change the IP and you’ve effectively killed and rebooted the machine, which is the exact opposite of “live.”

So the job is: the new pod on node B must come up with the same IP the old pod had on node A, traffic must switch over to node B at exactly the right moment, and network policy has to already be in place on node B before that switch, otherwise the VM lands on the new node and gets firewalled off from its own connections. Calico coordinates all of this. Here’s how the pieces fit.

Bridge mode is non-negotiable

Because Calico’s IP persistence depends on the VM IP matching the pod IP, only bridge mode qualifies. Modes like masquerade give the VM a different internal IP than the pod and NAT between them, which breaks both IP persistence and policy enforcement during migration. KubeVirt actually refuses to migrate a bridge-mode VM unless you opt in with an annotation:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: vm
spec:
  template:
    metadata:
      annotations:
        kubevirt.io/allow-pod-bridge-network-live-migration: ""
...

VM IP address persistence

Calico keys its IP allocation off the VM’s identity rather than the pod’s. Internally, instead of the usual per-pod IPAM handle, the address is held under a VM-scoped handle like k8s-pod-network.vmi.default.vm1. When the target pod is created during migration, Calico’s CNI plugin recognizes it as a KubeVirt virt-launcher pod, looks up that VM handle, finds the existing IP, and reuses it instead of allocating a fresh one.

This is a cluster-wide setting added in Calico v3.32 (kubeVirtVMAddressPersistence, enabled by default), and it’s mandatory for this type of migration. With persistence off, Calico rejects the migration target outright.

Don’t NAT the VM on the way out

If a VM talks to something outside the cluster and natOutgoing is enabled on its IP pool, the server on the other end sees the VM’s traffic as coming from the node’s IP. Migrate the VM and that source IP changes from node A’s address to node B’s, and any in-flight connection to that external server breaks. So for migratable VMs you disable natOutgoing on their pool, keeping the VM’s own IP on the wire end-to-end.

Switching the traffic with BGP route priority

Both nodes briefly believe they host the VM’s /32 route. Calico breaks the tie with route priority: the target host programs the route with an elevated priority (a lower kernel metric, 512 by default) than the source’s normal priority (1024). Lower metric wins in the Linux kernel, so traffic steers to the target. Within a single AS this propagates automatically (Calico maps the kernel metric onto BGP local_pref); across eBGP rack boundaries you carry the signal with BGP communities via a BGPFilter. After a convergence window (~30s by default), priorities return to normal.

Policy must land before the switch

This is the subtle one. If the VM activated on the target node before its network policy was programmed there, it would arrive into a node that doesn’t yet know its firewall rules, and get cut off. Calico prevents this with an interlock: when the CNI plugin sets up the migration target pod, it returns the IP with empty routes, deliberately not programming the host-side routes that would pull traffic over. Felix only completes that switch once policy is in place on the destination (governed by policy_setup_timeout_seconds). Policy first, traffic second.

Wrapping up

Live migration looks like magic, a running machine teleports between hosts and nobody notices, but it’s really two systems cooperating very carefully. KubeVirt handles the compute side: racing the guest’s memory across the wire with pre-copy until it converges, then cutting the CPU over in under a second. Calico handles the network side: pinning the VM’s IP to the VM’s identity so the new pod reclaims it, withholding routes until policy is programmed on the destination, switching traffic with BGP route priority, and cleaning up dual ownership after a convergence window, all without the VM’s address ever changing.

Get the networking mode wrong (anything but bridge), forget to disable natOutgoing, or skip the policy interlock, and “live” migration becomes a reboot with extra steps. Get them right, as the report above shows, and a stateful VM moves between physical machines while its TCP connections stay open and its identity stays put.

Watch an interactive demo: Live Migration of VMs Running on Kubernetes

The post Save the Address, Save the Cloud (KubeVirt VM Migration Story) appeared first on Tigera – Creator of Calico.

Six AI agent SDKs for enterprise Kubernetes, compared

Alister Baroi — Fri, 03 Jul 2026 20:44:49 +0000

There’s a question we hear constantly from platform and engineering leaders right now, “which agent SDK should we standardize on for our Kubernetes clusters?”

The honest answer is that the question is slightly wrong, and the rest of this post explains why. But it’s a fair question, so let’s compare the contenders first.

If you’re an enterprise running on-premise or in your own VPC, the SDK you pick has to do two things most of the _ “build an agent in 20 lines” _ tutorials skip over. It has to run in a container you control, and it has to talk to a model you can host yourself. That second one rules out a surprising amount.

The six SDKs most people are actually using

These are the ones with the most mindshare in mid-2026. There are others, but these are the names that come up in every conversation. They sit on a rough spectrum of model freedom: most will happily run against a model you host yourself, the OpenAI SDK will too but treats that as a side path, and one of them (Anthropic’s) is tied to a single vendor’s models. I’ve ordered them with the most flexible first.

LangGraph

LangChain’s lower-level library. You model your agent as a directed graph: nodes do work, edges decide what happens next, and the whole thing checkpoints its state so a long-running agent can pause, resume, and even rewind. If your problem is _ “this workflow is genuinely complicated and has to survive restarts,” _ LangGraph is the one built for that.

For on-prem it’s reasonable. There’s a Helm chart for self-hosting, the core is MIT-licensed, and it’s model-agnostic so you can point it at a local model. The catch is operational weight: a production self-hosted deployment wants Postgres for state and Redis for streaming, so you’re running real infrastructure, not just a pod. The platform layer on top is commercial.

CrewAI

The one your team will get running fastest. You describe a “crew” of agents with roles (“researcher”, “writer”) and let them collaborate. The learning curve is the lowest of the six, the core is open source and MIT-licensed, built from scratch without a LangChain dependency, and it’s genuinely model-agnostic. People wire it up to Ollama or a self-hosted vLLM endpoint without much fuss. There’s a Helm chart for the enterprise platform, and if you want it to feel native to Kubernetes you can wrap crews as custom resources with something like Kagent.

The tradeoff for that simplicity is control. When you want fine-grained say over exactly what happens at each step, the role-based abstraction can feel like it’s deciding things for you.

Google ADK

Google’s ADK. The model here is a hierarchy: a root agent delegates to sub-agents, and it speaks the A2A (agent-to-agent) protocol natively, so agents built in different frameworks can talk to each other. It’s Apache 2.0 and ships in Python, TypeScript, Go, Java, and Kotlin, with the Python implementation the oldest and most complete. Its own docs say it “can work with almost any generative AI model,” with documented support for Claude, Ollama, vLLM, and others through LiteLLM, so it’s genuinely model-agnostic despite the Gemini-first defaults.

It looks Google-Cloud-coupled, and it does have a one-command adk deploy gke path, but that’s a convenience, not a requirement. Underneath it’s a container. You can run it on any on-prem cluster with hand-written manifests, and Google has published a real reference for running ADK against a self-hosted Llama model on vLLM. It’s Gemini-first by default, but you can bring other models through LiteLLM. Less locked-in than the branding suggests.

Microsoft Agent Framework

The grown-up merger of two earlier projects: AutoGen, which is where a lot of the multi-agent research came from, and Semantic Kernel, which is where the enterprise plumbing lived. It runs on Python and .NET, which is the real reason it’s on this list. If you’re a Microsoft and .NET shop, this is the one that speaks your language, literally.

It does two kinds of orchestration: the loose, LLM-driven kind where agents reason their way through a problem, and the deterministic, business-logic kind where you want a workflow to run the same way every time. For on-prem it’s a good citizen. It’s MIT-licensed and open source, it’s genuinely model-agnostic with first-party connectors that include Ollama, and people are already running it on AKS or plain Kubernetes against local open-weight models like Qwen or Mistral. One thing to keep straight: Microsoft’s hosted agent service is an Azure product, but the framework itself is yours to run wherever.

OpenAI Agents SDK

The cleanest developer experience of the six. It ships in Python and TypeScript, agents hand off control to each other explicitly, the API is small, and if your team already uses the OpenAI API they’ll be productive in an afternoon. For self-hosting, you bring your own container and infrastructure, which is fine.

It’s also more model-flexible than the name suggests, and this is the part worth knowing because it’s easy to miss. The API guide on the OpenAI platform site barely mentions it, but the Agents SDK’s own documentation has a “Models” page that points you to non-OpenAI providers two ways. The clean one is any OpenAI-compatible endpoint: you set a base URL and an API key, which covers local models served through vLLM or Ollama. Beyond that, official LiteLLM and Any-LLM extensions reach 100-plus providers, though the docs flag those as best-effort and beta. So you can run it fully self-hosted against your own model. OpenAI is still the default and best-supported path, but the lock-in is softer than the name implies. The next entry is where the real model lock-in lives.

Anthropic Claude Agent SDK

Anthropic’s harness, and the same engine that powers Claude Code, exposed as a library in Python and TypeScript. It spawns and supervises a CLI subprocess that owns a shell and a working directory, which is a genuinely different model from the others. Every agent is a long-lived process with state on disk, so you think about it more like running a fleet of little workers than calling a stateless API. The SDK code is MIT-licensed, though Anthropic’s docs note that use of it is governed by their Commercial Terms of Service, and Anthropic ships Dockerfiles and Kubernetes manifests for self-hosting it.

The honest caveat is the model. This runs on Claude, full stop, and it’s the only one of the six with no supported way to swap in your own model. You can route through Amazon Bedrock, Google’s Gemini Enterprise Agent Platform (formerly Vertex AI), or Azure to keep traffic inside a cloud account you control, which helps with compliance, but those are all just channels for hosting Claude, not alternative model vendors. There’s no air-gapped, weights-on-your-own-GPU story the way there is with the open-weight crowd. If your on-prem requirement is about latency, data residency, or “our cloud, our keys,” it can work. If it’s about never sending a token off the box, it can’t.

The comparison at a glance

SDK	Languages	License	Strengths	Weaknesses	Ideal use case
LangGraph	Python, JS/TS	MIT	Durable checkpointed state, pause/resume/rewind, model-agnostic	Operationally heavy (Postgres + Redis), commercial platform tier, steeper mental model	Complex, long-running workflows that must survive restarts
CrewAI	Python	MIT	Fastest to ship, lowest learning curve, model-agnostic, K8s-native via KAgent	Less fine-grained step control, the role abstraction can over-decide	Rapid multi-agent collaboration, prototype to production
Google ADK	Python, TS, Go, Java, Kotlin	Apache 2.0	Native A2A, hierarchical delegation, broad language support, works with almost any model via LiteLLM	Gemini-first defaults, branding implies GCP lock-in	Multi-framework systems betting on agent-to-agent interop
Microsoft Agent Framework	Python, .NET (C#)	MIT	Creative plus deterministic orchestration, first-party connectors including Ollama, reached 1.0	Youngest of the group, hosted agent service is Azure-only	C#/.NET teams needing both orchestration styles
OpenAI Agents SDK	Python, JS/TS	MIT	Cleanest developer experience, small API, runs local models via OpenAI-compatible endpoints (LiteLLM/Any-LLM as beta options)	OpenAI is the default and best-supported path; broad provider routing is beta	Teams who want speed, lean on OpenAI, but want an escape hatch
Anthropic Claude Agent SDK	Python, TS	MIT code; use under Anthropic Commercial ToS	Claude Code engine as a library, ships Docker and K8s manifests	Claude-only (no non-Anthropic or local models), stateful subprocess hosting model	Claude-centric teams comfortable routing via Bedrock or Gemini Enterprise

Picking one (or, realistically, several)

LangGraph if the workflow is hard and stateful. CrewAI if you want multi-agent collaboration running by Friday. ADK if you’re betting on A2A and a mix of frameworks talking to each other. Microsoft Agent Framework if your stack is already C#/.NET, or you want both creative and deterministic orchestration in one place. OpenAI’s SDK for the cleanest developer experience, with non-OpenAI and local models available through OpenAI-compatible endpoints (or its beta LiteLLM extension) if you need them. Claude’s Agent SDK if you want the Claude Code engine as a library and Bedrock or Gemini Enterprise is close enough to “on-prem” for you.

Five of the six can run against a model you host yourself. Four treat that as a first-class path, the OpenAI SDK does it through OpenAI-compatible endpoints (with LiteLLM as a beta add-on), and only Anthropic’s Claude Agent SDK is locked to a single vendor’s models, though Bedrock or Gemini Enterprise at least keep that traffic in your own cloud. For an on-premise enterprise that model-freedom question is the biggest filter. After that, the choice is mostly about how your team thinks: graphs, crews, hierarchies, or handoffs.

The honest part, though, is that most enterprises don’t pick one. The data team gets something working in CrewAI in a day. A platform engineer builds the stateful pipeline in LangGraph because nothing else handles the checkpointing. The .NET team reaches for Microsoft’s framework. Someone ships a Claude or OpenAI SDK agent before anyone writes a standard down. A year later you’re running several of these at once, plus whatever lands next quarter. That’s not a failure of planning. It’s just what a healthy, fast-moving org looks like, and it’s worth designing for rather than fighting.

Governing the fleet you’ll actually have

Here’s the catch that sits underneath all six options. Once an agent is a running pod, the SDK that built it no longer matters. From the cluster’s point of view, every agent looks the same: a workload making network calls to a model and to tools, acting on behalf of someone, doing things you didn’t watch happen. The SDK’s view stops at the edge of its own process. Your security and platform teams’ problem doesn’t.

None of the six frameworks govern that. It isn’t their job. They help a developer build an agent; they don’t tell you which agents exist in your cluster, what they’re allowed to reach, or what they actually did with the credentials you handed them. And because you’ll be running more than one framework, anything that only governs agents written a particular way leaves most of your fleet uncovered.

This is the gap Tigera Lynx is built to close. It governs agents at the platform layer instead of inside any single SDK, so the same controls apply whether the agent was written in LangGraph, CrewAI, ADK, or something that doesn’t exist yet. Lynx discovers the agents already running, including the ones nobody registered, using eBPF down at the kernel where the network call happens. An agent that skips your gateway entirely still shows up, because a syscall is a syscall regardless of the framework above it.

From there it puts a single control point in the path of every agent interaction and requires no changes to the agent’s code to do it. If governance depends on every developer importing your library and using it correctly, you don’t have governance, you have a polite request. Lynx works at the level where that assumption can’t break: discovery, policy, and a full audit trail your security team gets handed instead of reconstructing after the first incident. It’s already running in production at large banks, which are not known for a relaxed view of risk.

Pick the SDK that fits how your team builds. The decision that actually carries risk is whether anything sits between your agents and the rest of your cluster once they’re live, and that layer has to be SDK-agnostic, because your fleet already is. If your teams are shipping agents faster than you can govern them, see how Lynx governs AI agents on Kubernetes.

See Lynx discover and govern agents in a 3-minute interactive demo →

The post Six AI agent SDKs for enterprise Kubernetes, compared appeared first on Tigera – Creator of Calico.

Introducing Tigera Lynx

Alister Baroi — Fri, 19 Jun 2026 17:45:47 +0000

Today we're announcing the general availability of Tigera Lynx, a unified control plane for Kubernetes-native AI agents.

Lynx gives enterprises a single place to find every agent in their Kubernetes estate, tighten posture, assign a sandbox, give each agent a cryptographic identity, enforce policy on every action it takes, audit what agents actually do, and detect anomalous behavior — without changing a line of agent code.

It sits in the path of every agent call (agent-to-agent, agent-to-tool, and agent-to-LLM) to authenticate, authorize, mediate, and audit each one. It plugs into the tools you already run, including your identity provider (Entra ID, Okta) or SPIFFE/SPIRE and your existing observability systems, and is built on open standards rather than proprietary lock-in.

Built on a decade of deep Kubernetes network security experience, Lynx is generally available today 👉 https://www.tigera.io/tigera-products/lynx/

How Lynx Works: A Technical Walkthrough

Alister Baroi — Thu, 18 Jun 2026 17:19:30 +0000

We launched Lynx this week. Instead of restating the pitch, I want to explain how it’s built and why we made the architectural choices we did. If you run Kubernetes and you’re starting to put AI agents on it, this is roughly the system you’d end up designing yourself.

Lynx is a control and data plane for all agentic AI traffic, providing a registry, gateway, audit, authentication with token exchange, policy enforcement, agent sandboxing, shadow agent discovery, and advanced AI capabilities such as red team agent and a guardian supervising agent to keep your agents on track. Lynx is single control point in the path of every agent call – agent-to-agent, agent-to-MCP, agent-to-LLM. Every call is authenticated, authorized against policy, and recorded, with no changes to agent code.

The constraints we started from

Four principles shaped the design:

No agent code changes. Governance has to be applied by the platform, not adopted as a library. If it requires a code change, it won’t land uniformly – and uniformity is the entire point.
No new database in the control plane. The source of truth is the Kubernetes API server and the data model is custom resources – there’s no separate datastore to run, back up, and secure. (Telemetry is the one thing that needs a column store at scale; that’s kept separate and is bring-your-own.)
Don’t reinvent the data plane. Proxying agentic protocols – MCP, A2A, streaming LLM traffic – well is a full-time job. We wanted to own the policy, not the proxy.
Catch what doesn’t opt in. A governance layer that only sees traffic routed through it is blind exactly where the risk is. We needed an out-of-band way to find the agents nobody registered.

The data model

Lynx is Kubernetes-native to the core: its entire vocabulary is a small set of custom resources – Agent, MCPServer, LLMProvider, ServiceIdentity, and Policy – stored in the Kubernetes API itself. There’s no Lynx database; every record is something you can manage, GitOps, and RBAC like anything else in your cluster. The registry is a thin API in front of these resources. It records agents; it doesn’t run them.

Two ideas matter throughout. First, workload identity is the join key that ties a running pod to its registry record. Second, an agent becomes governed by two independent acts – it runs with an identity, and someone registers it – which means registration can happen in CI/CD at deploy time while the workload itself stays unaware of Lynx.

Identity: reuse what you already trust

A workload proves who it is with one of two mechanisms – it’s one or the other, and which one is recorded in the workload’s registration:

SPIFFE/SPIRE for mTLS workload identity, where private keys never leave the pod.
OIDC for tokens from your existing IdP (Entra ID, Okta, Keycloak). The binding is the issuer/subject pair you record at registration time. Because the in-cluster Kubernetes API server is itself an OIDC issuer, this path also covers plain Kubernetes ServiceAccount tokens – a pod’s projected token simply is its identity, with nothing to mount beyond what Kubernetes already gives every pod.

You can register more than one identity on an agent – which turns an IdP migration into a config change rather than a cutover – but any single call authenticates by exactly one. Human access to the dashboard uses the same IdP over OIDC, kept distinct so a person’s token is never mistaken for an agent’s. And not every caller is an agent: a ServiceIdentity lets a plain service or human-operated client be governed by the same machinery.

The validation pipeline behind all of this is deliberately strict and shared by every component: issuers are matched against a per-service allow-list (there is no “any issuer” mode), tokens are signature-verified and bound to an audience, and keys rotate automatically. The result is that agents reuse identity you already trust, rather than living in a parallel, ungoverned credential system.

One gateway for A2A, MCP, and LLM

The framing that matters most: Lynx is a single consolidated gateway for all three classes of agentic traffic. Today these tend to be governed by three different things – a service mesh for agent-to-agent, a bespoke proxy or SDK wrapper for MCP, an egress gateway or nothing at all for LLM calls. That fragmentation is how you end up with three identity models, three policy languages, and three audit trails nobody can correlate.

Lynx collapses them into one control point with one identity model, one policy language, and one audit trail. Agents, MCP servers, and LLM providers are all first-class objects with their own governed routes, and every call – whatever its kind – is authenticated, authorized, and recorded the same way.

The LLM path has a property teams feel immediately: the gateway holds the provider credential, the agent never does. Upstream API keys live in one governed place, rotate centrally, and never sit in agent pods – and when a provider needs no upstream auth, the gateway strips the caller’s credential so a Lynx-issued token can’t leak to a third party.

The data plane: drive the proxy, don’t fork it

The proxy in the request path is agentgateway, the open-source Rust proxy purpose-built for LLM/MCP/A2A traffic. We run it unmodified and drive it the way Envoy is driven – over xDS. Our control plane watches the custom resources and compiles them into the proxy’s native configuration; the proxy itself never sees a Lynx resource, has no Kubernetes access, and holds no cluster privileges.

That decoupling is deliberate, and it buys four things:

Blast radius – a malformed registration drops one route; it can’t corrupt the proxy.
Least privilege – the component on the wire has no API-server reach.
Schema freedom – we evolve our data model without touching the proxy.
Hot reconfiguration – register an agent and its route is programmed live, no restart.

The customer who already likes agentgateway gets it as-is, with Lynx’s governance layered on through the same open extension points they already trust – no proprietary lock-in at the data path.

The decision point: policy in the path, credentials minted per hop

Before the proxy forwards a request, it calls back into Lynx’s decision point, which runs the same sequence every time: authenticate the caller, validate the destination’s requirements, and evaluate policy in Cedar – a formally-grounded language, default-deny, with LLM, MCP, and agent access all expressed in one model. Only on an allow does the request proceed.

The property I care most about is what happens on allow: t he gateway mints a fresh, short-lived credential scoped to that one hop. When Agent A calls Agent B, A never holds a credential for B – it proves only its own identity, and the gateway issues a token good for exactly that destination, for a couple of minutes. A leaked token is useless beyond a single hop: no shared secrets, no standing keys, no blast radius.

For multi-step chains – A calls B, which calls a tool – this extends into proper on-behalf-of delegation built on RFC 8693 token exchange. An agent presents the token it already has and asks for a destination; Lynx validates it, checks policy at the moment of issuance (so an unauthorized hop never even produces a credential), and mints a destination-scoped token carrying the original subject. The payoff is threefold: agents stay IdP-agnostic (one endpoint, one credential), delegation is genuine and auditable end-to-end rather than just the last leg, and least privilege is enforced by construction. To the rest of your estate, Lynx looks like an ordinary OAuth2 provider – standards in, standards out.

Catching what routes around the gateway

A gateway only governs traffic that flows through it – and the agents that don’t route through it are exactly the ones worth finding. So Lynx watches at a layer the workload can’t bypass or tamper with: the kernel, via eBPF, deployed as a per-node agent that needs no instrumentation of the workloads it observes.

The first signal is LLM egress. Any workload calling a provider does a TLS handshake; Lynx observes that in the kernel, attributes it to a pod, and checks whether that pod is a registered agent – classifying each as registered, a shadow agent , or unattributable. This is the backstop for the LLM path specifically: even an agent that calls a provider directly, bypassing the gateway entirely, still does a handshake the kernel sees. The gateway governs what routes through it; this finds what goes around it.

Agent sandboxing

The same vantage point is also an enforcement point. Lynx can run each agent inside a tailored kernel-level sandbox – a per-workload syscall policy that constrains which operations the pod may perform – rather than letting it act with the full ambient authority of its pod. Notably, those policies are written in the same Cedar language as request authorization and compiled down to run in the kernel, so one policy model drives both the request path and the sandbox. Because enforcement lives in the kernel, a flagged or shadow agent is contained immediately and “unbypassably”, rather than merely alerted on.

This is also where the platform is heading next: a per-agent behavioral baseline over kernel-level activity, with anomaly detection for the cases a request-time policy can’t catch – credential theft, lateral movement, an agent doing something it never has – and agent-specific threats such as memory and context poisoning. Policy governs intent; this layer is about what actually happens.

Tracing, audit, and compliance

Everything emits OpenTelemetry, and the design choice that pays off here is that the gateway’s authorization decisions and the agents’ own reasoning – their LLM and tool calls – land in the same distributed trace. You don’t get one system for “what the agent did” and a separate one for “what the platform allowed”; you get a single, correlated timeline of each interaction.

That timeline is what turns governance into an audit story. Every call carries who the caller was, on whose behalf it acted, which policy permitted it, and what the decision was – and because each hop is independently authorized and freshly credentialed, the chain is attributable end-to-end , not just at the edge. Alongside the request traces, every change to the system itself – a registration, a policy edit – is recorded as an audit event capturing the actor, the operation, and the exact before-and-after. Together these are the reproducible, cryptographically attributable record that incident response and auditors ask for, and that frameworks such as SOC 2, HIPAA, GDPR, and financial-services regimes require you to produce on demand – without a separate logging project bolted on after the fact.

Traces and audit records flow into ClickHouse (bring-your-own), which powers the dashboard’s inventory, policy editor, audit search, agent-to-agent traffic graph, and shadow-agent views.

Driving Lynx: dashboard, CLI, and MCP

Everything in Lynx is an API over Kubernetes resources, so there are three ways to operate it – all thin clients over the same control plane:

The dashboard. A web UI for the people who live in this day to day – agent and MCP inventory, a Cedar policy editor, the agent-to-agent traffic graph, audit search, and trace exploration. It’s a Next.js and React app that renders agent execution traces with agent-prism, so a multi-hop, multi-agent interaction reads as a single timeline.
lyctl. A single Go CLI for everything scriptable – registering agents and MCP servers, authoring and testing policies, and standing up a complete demo environment in one command. It’s the natural fit for CI/CD, where registration belongs.
MCP. Lynx ships its own Model Context Protocol server that exposes the governance operations – list and register agents, write policies, inspect audit traces – as MCP tools. So you can drive Lynx straight from an AI assistant like Claude or Cursor : “register this agent,” “which agents can reach the payments MCP server?”, “what changed in policy yesterday?” The platform that governs agents is itself operable by one.

Built on open standards

We deliberately built Lynx on proven, open technology rather than inventing a parallel stack – it’s why it drops into an existing cluster and speaks the protocols your tooling already speaks:

Kubernetes-native – the entire data model is custom resources in the Kubernetes API; it installs as a single Helm chart and runs no database of its own.
Identity – SPIFFE/SPIRE for workload mTLS, and OIDC/OAuth2 with your existing IdP (including Kubernetes ServiceAccount tokens). Per-hop delegation uses RFC 8693 token exchange, and tokens are verified through standard JWKS.
Policy – authorization is expressed in Cedar, a formally-grounded, open policy language – the same language whether it’s evaluated in the request path or compiled into the kernel.
Data plane – the open-source agentgateway proxy, driven dynamically over xDS and integrated through the standard ext-authz contract, with native fluency in MCP and A2A.
Visibility and enforcement – eBPF for kernel-level discovery and sandboxing, with no instrumentation of the workloads themselves. Observability – OpenTelemetry end to end, stored in ClickHouse.

The throughline: Lynx contributes the governance layer – identity binding, Cedar policy, per-hop credentials, the agent registry – and bridges to everything else through open, standard contracts. No proprietary lock-in at the parts that matter most.

How it installs

Lynx is a single Helm chart on any conformant Kubernetes cluster. The minimal install is the registry and the gateway; the data plane, the policy decision point, the kernel-level detector, the telemetry pipeline, and the UI are each switched on as you need them. The most revealing first step is to turn on discovery and watch what’s already talking to LLM providers across your fleet – for most teams, that first scan surfaces agents nobody knew were running.

Explore our product page to see Lynx in action.

The post How Lynx Works: A Technical Walkthrough appeared first on Tigera – Creator of Calico.

Why We Built Lynx: Bringing Control to the Age of AI Agents

Alister Baroi — Wed, 17 Jun 2026 13:00:22 +0000

For a decade, one idea has guided everything we’ve built at Tigera: How do you secure a dynamic system with a lot of moving parts that is changing rapidly, with a programmatic approach? Calico has applied that idea for Global 2000 companies running the largest Kubernetes platforms in the world, securing tens of millions of mission-critical transactions every day. Today I’m excited to announce the next chapter of that work: Lynx, a unified control plane for Kubernetes-native AI agents.

This enables us to apply our deep knowledge of Kubernetes, eBPF, and our expertise in building scalable and highly performant systems to solve the security challenges that come with deploying AI Agents. Before I explain how Lynx addresses these challenges, it’s worth being clear about why AI agents are so hard to secure in the first place.

AI agents broke the assumptions security stacks were built on

The enterprise security tooling most organizations run was designed for workloads that are deterministic. A service does roughly the same thing today that it did yesterday. You can reason about its behavior, define what it’s allowed to touch, and trust that a valid credential maps to expected actions.

AI agents don’t work that way. They’re autonomous and non-deterministic. An agent acts on behalf of a user, reaches for whatever tool, LLM, or other agent it needs, carries a delegation chain, and reads untrusted input as it goes. The same agent can take a different path every time it runs. A valid credential no longer guarantees good behavior, it just guarantees the door opens. And every time a new agent or tool comes online or there are changes in the platform, the blast radius shifts again.

This leaves three teams staring at the same problem from three different angles, and none of them able to give a confident answer. The AI team wants to experiment with the latest technology and move fast. Platform engineering teams are measured on how fast they delpoy, but can’t prove the platform is actually under control. And the security team is asked to approve agents whose posture they have no real way to defend. Everyone needs to be accountable, but no one has the right controls.

Lynx closes that gap.

What Lynx does

Lynx sits in the path of every agent call, whether agent-to-agent, agent-to-tool, or agent-to-LLM, and authenticates, authorizes, mediates, and audits each one. It does this without changing a line of agent code, and it plugs into the tools you already run: your identity provider (EntraID, Okta) or SPIFFE/SPIRE, and your existing observability systems. It’s built on open standards, not proprietary lock-in.

One control plane brings together five capabilities that, until now, teams have been trying to stitch together by hand.

It starts with discovery. A central registry catalogs every agent, including its owner, its purpose, and its version, while eBPF-powered auto-discovery finds the agents nobody registered. Shadow agents are flagged and quarantined, and any agent’s actions can be reconstructed end-to-end through OpenTelemetry traces.

From there, Lynx manages posture. AI-CSPM continuously evaluates every agent against a baseline and surfaces drift and over-permissions the moment they appear, with per-agent sandboxing and pre-built compliance packs mapping to GDPR, HIPAA, SOC 2, and financial services requirements. A Red Team Agent continuously probes for weaknesses in posture and misconfigurations.

It gives every agent a real identity. Each one receives a verifiable cryptographic identity through integration with your identity provider (EntraID, Okta) or through SPIFFE/SPIRE, with no shared secrets. Long-lived API keys give way to short-lived, tightly scoped, auto-rotated tokens. A JWT token is minted for every hop of a multi-agent workflow so credentials are scoped to a single hop rather than handed around.

Lynx authorizes every transaction and enforces policy at the gateway. A single default-deny policy governs LLM, MCP, and agent access using the Cedar policy language, evaluated before any call executes. A misbehaving agent can be quarantined instantly, and a high-stakes call can be routed to a human—again, with no agent code changes. Lynx also provides the other controls that you need to secure and manage your agent: prompt injection, rate limiting, guardrails, budgets, spend limits, custom webhooks, MCP multiplexing, and aggregation and session management.

Lastly, Lynx watches for anomalous behavior at a layer agents can’t tamper with. eBPF and LSM observe every syscall, network call, and file access at the kernel, catching credential theft and lateral movement even when an action technically passes policy. This produces a forensic audit trail, and a Guardian Agent detects anomalous behavior and quarantines suspicious agents.

Security & visibility for Kubernetes-native AI

When I look at AI agents, I don’t see a new category that requires us to start over. I see the next class of workload that is autonomous, distributed, and increasingly embedded in critical business processes. I see AI agents actively interacting with traditional applications and databases running in containers/VMs, and the need for a unified solution that can secure this traffic. The discipline is the same one we’ve practiced from the start: give every workload a verifiable identity, evaluate every action against policy before it runs, and observe behavior closely enough to catch what policy alone can miss. And do this in a manner that is agnostic of the underlying infrastructure so that we can help you avoid platform lock-in and the risk of a price hike that goes with it.

As our CTO Peter Kelly puts it, because we watch behavior with eBPF and LSM at the kernel, we can detect an agent going wrong even when it carries a valid credential—and produce a reproducible audit trail to prove it. That’s the difference between hoping an agent behaves after acquiring a valid token, and knowing what it did.

AI agents are going to keep multiplying across your estate. The question isn’t whether you’ll run them. It’s whether you can see them, govern them, and answer for them. With Lynx, you can.

Lynx is generally available today. It scales horizontally on a Kubernetes-native architecture with no per-call overhead, and it’s already running in production at some of the world’s largest banks.

Explore our product page to see Lynx in action.

The post Why We Built Lynx: Bringing Control to the Age of AI Agents appeared first on Tigera – Creator of Calico.

Five Principles of an Accountable AI Agent Network: How to Evaluate Any Governance Platform

Alister Baroi — Wed, 10 Jun 2026 20:19:08 +0000

The first post in this series argued that AI agent governance hasn’t kept pace with deployment. The second laid out the five pillars of accountability, and what is required. The third walked through why network policies, API gateways, MCP/A2A protocols, DIY security patterns, and Role-based Access Control (RBAC) each leave critical accountability gaps.

So what does good look like?

The five pillars define what AI agent accountability requires. The principles below define how a governance platform should deliver it. These are the architectural principles your team should evaluate any AI agent governance solution against, whether you build it, buy it, or assemble it from open-source components.

If a vendor pitches you a governance platform that fails any of these five, walk away.

What are the five principles of an accountable AI agent network?

Kubernetes Network Policies are essential for securing any cluster. They restrict which pods can communicate with which other pods at the network level, and they should absolutely be part of your security posture.

Default-deny: No agent communicates unless a policy explicitly permits it.
Attribute-based policy: Policies reference agent attributes, not agent names.
Zero-trust identity: Every request authenticated, every identity verified.
Audit by design: Every interaction produces a structured, correlated trace automatically.
Kubernetes-native: The platform extends your existing infrastructure rather than replacing it.

Each principle below explains why it matters and what a passing solution looks like.

Use the five principles as a checklist when evaluating any governance platform. Fail any one, and the platform is one missing principle away from security theater.

Principle 1: Default-deny

No agent communicates with any other agent unless explicitly permitted by policy.

This is the only safe starting posture for accountability. If your governance layer defaults to allowing communication and only blocks what’s explicitly forbidden, every interaction you didn’t anticipate is ungoverned, and you can’t be accountable for what you didn’t authorize.

Default-deny flips the model: nothing is allowed until a policy explicitly permits it. Every allowed interaction is intentional, traceable, and auditable. New agents are isolated by default until policies are written to grant them access, which is exactly the behavior you want in a governed network.

Default-deny seems restrictive, but in practice it’s liberating. Your security team doesn’t have to anticipate every possible _ bad _ interaction. They only have to define the good ones.

Principle 2: Attribute-based policy

Policies should reference agent attributes, not agent names.

Hardcoding agent names in policies creates a governance system that breaks every time you add or rename an agent. It’s the equivalent of maintaining a firewall with hundreds of IP-based rules instead of using network segments.

Attribute-based policies reference properties like capabilities, risk levels, team ownership, and environment tags. Instead of “Agent-Finance-v2 can call Agent-Compliance-v3,” the policy says “Agents with capability=financial-analysis can call agents with capability=compliance-query.”

This approach has a powerful scaling property: when a new agent registers with matching attributes, existing policies apply automatically. The governance grows with the agent network, not against it. A team deploying a new agent doesn’t need to file a ticket to update allow-lists, they describe the agent’s attributes at registration time, and the policy engine handles the rest.

This is the principle that separates a security model that survives at 10 agents from one that survives at 1,000.

Principle 3: Zero-trust identity

Every request authenticated. Every identity verified. Trust nothing by default.

Agent networks are susceptible to the same identity threats as any distributed system: spoofing, replay attacks, credential theft. But agents add an unique challenge: they operate on behalf of the users. This means both the workload identity (is this actually the agent it claims to be?) and the user identity (on whose behalf is this agent acting?) must be verified.

A governance platform should support dual authentication : cryptographic workload identity (proving the agent is genuine) and token-based user identity (establishing who triggered the action). Both identities should be available for policy evaluation and audit logging.

Short-lived credentials, automatic rotation, and cryptographic verification should be standard, not optional add-ons. Static API keys and long-lived tokens are liabilities in an agent network; compromised credentials can enable automated lateral movement at machine speed.

Principle 4: Audit by design, not by afterthought

Every interaction produces a structured, correlated trace automatically.

If your team has to add logging after the fact, you’ve already lost accountability. Audit records should be a byproduct of the governance layer’s enforcement , not a separate system bolted on later.

When the governance layer evaluates a policy and permits (or denies) an agent interaction, that evaluation is the audit record. It captures: who called whom, what policy was evaluated, what the decision was, what attributes matched, and when it happened. These records should be:

Structured (not free-text logs),
Correlated across multi-hop chains (using distributed trace IDs),
Queryable by agent, by policy, by time range, by outcome.

The practical implication: the audit trail should be a first-class product of the governance platform, not a configuration option. If you have to enable it, someone will forget. If it’s built in, it’s always there.

Principle 5: Kubernetes-native

The governance layer should work with your existing infrastructure, not replace it.

Enterprises have invested heavily in Kubernetes, Helm charts, GitOps pipelines, RBAC, namespaces, and observability stacks. An AI agent governance platform that requires a separate control plane, its own deployment model, or a proprietary orchestration layer will face adoption resistance and operational overhead.

The governance platform should be deployable via Helm, manageable via CRDs, observable (e.g. via Prometheus or OpenTelemetry), and compatible with existing identity infrastructure (OIDC providers, SPIFFE/SPIRE). It should feel like a natural extension of the Kubernetes platform, not a foreign system that happens to run on it.

This isn’t just about developer experience. It’s about operational sustainability. If the governance platform requires specialized skills your platform team doesn’t have, it will become a bottleneck instead of an enabler.

How the principles reinforce each other

These five principles aren’t independent. They reinforce each other:

Principle	What it enables
Default-deny	Provenance; every allowed interaction was explicitly authorized
Attribute-based policy	Governance at scale; authorization grows with the network
Zero-trust identity	Trust in audit records; every participant is verified
Audit by design	Traceability and compliance; every decision is recorded
Kubernetes-native	Adoption; the platform integrates with existing infrastructure

When evaluating governance solutions, test each principle:

If a solution requires you to default to allowing communication and only block specific interactions, it fails Principle 1.
If it requires naming agents in policies, it fails Principle 2.
If it relies on static API keys or long-lived tokens, it fails Principle 3.
If it doesn’t produce correlated audit trails automatically, it fails Principle 4.
If it needs its own control plane outside Kubernetes, it fails Principle 5.

The right solution delivers all five. Because accountability requires nothing less.

Frequently asked questions

What’s the difference between default-deny and zero-trust?

Default-deny is a policy posture — no communication unless explicitly permitted. Zero-trust is an identity posture — every identity must be verified, every time. They reinforce each other but aren’t interchangeable. A platform with zero-trust identity but default-allow policy is still ungoverned.

Why does Kubernetes-native matter for AI agent accountability?

Because adoption is the difference between a governance platform that works and one that gets shelved. If your platform team has to learn a new control plane, run a parallel deployment pipeline, or operate a proprietary policy engine, the governance layer becomes a bottleneck — and ungoverned agents start showing up because the official path is too slow.

Can I build this myself with SPIFFE, OPA, and OpenTelemetry?

Technically yes. Practically, you’ll spend 6–12 months on the integration glue, audit correlation across multi-hop chains, dual identity verification, attribute-based policy modeling, and the human oversight surface. We covered the build-vs-buy tradeoff in post 3 of this series.

Are these principles specific to Tigera Lynx?

No. These are architectural principles for any accountable agent governance platform — whether commercial, open source, or homegrown. We use them ourselves to evaluate Lynx, and we’d encourage you to use them to evaluate every option you consider.

Key takeaways

Default-deny is the only safe starting posture. Anything else leaves ungoverned interactions.
Attribute-based policy is the principle that lets governance scale past 100 agents.
Zero-trust identity must verify both the workload (is this the right agent?) and the user (on whose behalf is it acting?).
Audit by design means audit records are a byproduct of enforcement, not a separate system.
Kubernetes-native ensures the platform actually gets adopted instead of bypassed.

Get the strategic guide for accountable AI agents

We wrote a strategic guide, Accountable AI Agents: A Strategic Guide for AI & Security Leaders Governing Autonomous AI at Scale, that walks through these principles in depth, including a side-by-side comparison of common governance approaches and how they score against each principle.

Get the strategic guide for accountable AI agents →

The post Five Principles of an Accountable AI Agent Network: How to Evaluate Any Governance Platform appeared first on Tigera – Creator of Calico.

Kubernetes Operational Maturity: Why You Should Modernize Your Ingress with Gateway API

Alister Baroi — Wed, 10 Jun 2026 19:12:12 +0000

SIG Network introduced Ingress in 2015 as a minimal way to expose HTTP services from a cluster. That simplicity was an advantage at a time when most workloads were HTTP, clusters were single-tenant, and the occasional gap could be papered over with a vendor annotation. As adoption grew and Kubernetes started running serious production workloads across multi-tenant, multi-cluster, multi-protocol environments, the annotations multiplied into incompatible dialects, and most organizations outgrew what Ingress could handle on its own.

The Ingress-NGINX Controller retirement, and the migration conversations that followed, exposed these cracks, but they were never the full story. Ultimately, ingress needed to grow up and the arrival of Gateway API, with SIG Network freezing the Ingress at v1 in favor of this successor, was what that looked like.

Even if migration has not been forced on your organization by the Ingress NGINX retirement, any team trying to reach Kubernetes operational maturity should be considering Gateway API as the next step on that journey.

Three reasons why Gateway API is more than and Ingress replacement

Gateway API is not just a new and improved Ingress with a few additional features bolted on. It re-architects incoming traffic management in three key ways that are essential to any organization quickly growing beyond one or two teams operating a couple of clusters: it now supports common protocols beyond HTTP, it provides standardized schemas for advanced traffic routing and it has decoupled infrastructure from application traffic routing allowing separation management concerns.

Gateway API should be on the roadmap if any of the following use cases apply to your organization:

1. You need expanded protocol support

Are you running a diverse collection of AI workloads in your clusters? Do you host streaming services? Do your workloads need external database access?

Protocols like gRPC, TLS, TCP, and UDP are now integrated as first-class resources rather than being treated as secondary extensions requiring complex annotations or vendor-specific workarounds. This isn’t a cosmetic change. When the Ingress API was designed, treating HTTP as the default was a reasonable assumption. In 2026, it isn’t. AI inference traffic is overwhelmingly gRPC. Real-time streaming, external database access, and edge workloads rely on TCP or UDP.

Management overhead increases and the attack surface expands with every cumbersome workaround that is required when an API fails to provide native support for these protocols. gRPC, TLS, TCP, and UDP should be treated as first class citizens, not as exceptions.

Routing support for multiple protocols

2. Your traffic routing needs to support complex scenarios such as weighted load balancing, cross-cluster failover, canary deployments and more

If your traffic management use cases go beyond the traditional host and path rules supplied by Ingress, Gateway API is the solution best suited for these complex scenarios. Rules for weighted load balancing, cross-cluster failover, and canary deployments are now built directly into the HTTPRoute specification, eliminating the need for annotations, external routing components or bespoke systems.

In addition to a reliance on annotations, traffic routing can live outside the layer. Maybe the platform team stands up a service mesh just for canary support or application teams write retry and failover logic into their services. Sometimes someone writes a homegrown traffic controller that runs on an old server under their desk. An organization with SLOs, revenue that depends on high availability or has compliance requirements should aim for a standardized rule schema that supports most, if not all, traffic routing use case.

Rules for advanced traffic management

3. You have multiple application teams trying to ship services each with its own routing requirements

When the Ingress API was designed, the cluster admin owning every routing rule was a reasonable design choice. As Kubernetes environments scale to thousands of nodes running countless services, this pattern becomes an obvious bottleneck. It is simply not possible, given the ratio of developers to infrastructure engineers in most organizations, for cluster operators to keep up with the deployment of new services and routing configurations.

In order to scale, an organization needs to empower teams to self-serve. It needs to separate the infrastructure layer from the routing layer and enforce security via RBAC. Gateway API provides this modularity with infrastructure teams managing GatewayClass for broad policies, platform teams oversee the Gateway as a shared network entry point, and application teams independently controlling specific routing rules with each team having access to only the resources they are responsible for.

Empower teams to manage what they own

The bottom line is that if your organization is planning to grow, you need your Kubernetes operations to mature and, consequently, you need to modernize the way you handle ingress. You need to adopt an architecture that treats multi-protocol routing as first-class, expresses traffic management as part of the spec, and gives each team in the chain the ownership it needs if you want the complexity that comes with scale to be manageable.

How mature is your ingress?

Most teams can place themselves in one of these four stages within a sentence or two.

Beginner. The Kubernetes Ingress API with annotation-driven customization. A single ingress controller managed by the cluster admin. No traffic splitting. Every new service or routing change goes through a ticket.

Intermediate. Migrating to Gateway API, often by swapping the ingress controller for a Gateway API implementation. Basic HTTPRoute rules deployed. Application teams still depend on the cluster admin or platform team for Gateway-level changes.

Advanced. Full Gateway API adoption with role separation enforced through RBAC. Weighted traffic splitting and automatic failover working across clusters. Multi-protocol routing is live in production.

Optimized. Gateway API integrated with CI/CD for progressive rollouts. Cross-cluster traffic management with automated canary analysis. Multi-protocol routing managed declaratively by application teams. The platform team is no longer in the path of routine deployments.

Migration Should Be More Than a Simple Replacement

The Ingress NGINX retirement put ingress on the table for many organizations, but the real opportunity isn’t tied to that deadline. It’s the chance to rebuild an ingress layer that was designed for a different era of Kubernetes around the workloads and team structures most clusters run today. Teams that treat the migration as a chance to modernize come out with multi-protocol routing, declarative traffic management, and role-based ownership as platform capabilities. Teams that treat it as a controller swap come out with a new logo on the same architecture. The difference is the difference between a migration and a modernization.

Ingress is one of nine pillars in the operational maturity reference architecture. The full nine-pillar reference architecture, including the egress, microsegmentation, observability, and service mesh pillars that build directly on cluster mesh, is in our ebook Building Resilient Multi-Cluster Kubernetes. If you would rather work through it hands-on, our reference architecture workshop walks the first five pillars, the next steps on your operational maturity journey, in a working environment.

Read our ebook, Building Resilient Multi-Cluster Kubernetes

The post Kubernetes Operational Maturity: Why You Should Modernize Your Ingress with Gateway API appeared first on Tigera – Creator of Calico.

Multi-Layer Policy for Securing AI Agents

Alister Baroi — Wed, 03 Jun 2026 18:58:09 +0000

As part of our work at Tigera building products that create secure runtime environments for enterprise agents at scale in the real world, one small part of this puzzle I think about a lot is policy, and runtime enforcement of policy, and how to create a comprehensive secure runtime, configured from one place. The more companies we talk to trying to lock down and secure these platforms at runtime, the more I believe AI Agent security needs policy in multiple places, not just one (e.g., not just at the gateway layer), and ideally expressed in the same policy language.

At the L7 gateway layer, every agent call is observable: who is calling, what they are calling, what attributes both sides carry, what the requested action is. This is where you decide whether an agent should be permitted to talk to a particular MCP server, invoke a particular tool, delegate to another agent, or call a particular LLM. The atoms of policy here are identity, action, resource, and context.

At the agent runtime layer, or kernel layer in a container, what the agent does inside its own runtime is observable: syscalls, file access, library loads, network connections that bypass the brokered channel. This is where you decide whether the agent can read a file, open a socket, spawn a subprocess, or load a library. The atoms of policy here are processes, paths, file descriptors, and system calls.

Both layers are necessary. The gateway alone cannot constrain what an agent does inside its runtime once it holds a token. The kernel alone cannot reason about identity, delegation, or multi-hop intent. Building policy at one and not the other leaves a category gap.

The architectural choice that makes this work in practice is using one policy language for both. We use Cedar at the gateway and interpret and translate Cedar to eBPF policy for the kernel: same policies, two enforcement points, one place to author and review.

Policy at the gateway: enforcing agent intent

The gateway sees intent. It is the right place to enforce who can talk to whom, under what conditions, with what level of human oversight.

A Cedar policy that constrains which agents can invoke which tools:

permit (
  principal in Group::"finance-agents",
  action == Action::"invokeTool",
  resource in ToolSet::"finance-readonly"
) when {
  principal.risk_level == "low" &&
  context.delegation_depth <= 3
};

This policy expresses several things that are hard to model in RBAC or in a network policy. The principal is identified by group membership but constrained by attribute (risk_level). The resource is a typed set of tools. The condition includes a check on delegation depth; agents three hops deep in a delegation chain are refused even if they pass every other check.

The gateway layer naturally enforces delegation rules, per-hop token issuance with scope reduction, agent-to-MCP tool authorization, agent-to-LLM constraints, human-in-the-loop hooks for high-stakes actions, and attribute-based decisions across all of these.

What the gateway cannot do is constrain what happens after it issues a token. Once the agent has the credential, the kernel is the only layer that sees what the process actually does with it.

Policy at the kernel: constraining agent behaviour

The kernel sees behaviour. It is the right place to enforce what an agent process is allowed to do at the operating system level, regardless of what tokens it holds.

A baseline sandbox for an agent workload, expressed conceptually in the same Cedar policy model and compiled to BPF programs at runtime:

permit (
  principal in AgentClass::"data-analyst",
  action in [Action::"readFile", Action::"writeFile"],
  resource is FilePath
) when {
  resource.path like "/workspace/analyst-*" ||
  resource.path == "/var/run/secrets/analyst-key"
};
forbid (
  principal in AgentClass::"data-analyst",
  action == Action::"connectNetwork",
  resource is NetworkDestination
) unless {
  resource.host in DestinationSet::"approved-llm-endpoints" ||
  resource.host == "lynx-gateway.internal"
};

The compilation target is BPF LSM hooks, cgroup network hooks, and file access enforcement at the inode boundary. When the agent process steps outside what the policy permits, the kernel refuses the operation – EPERM for the syscall, ECONNREFUSED for the network connection, ENOENT for the file access. The agent gets the same error it would get for any prohibited operation, regardless of what credentials it holds.

The kernel layer naturally enforces file access boundaries, network egress restrictions, syscall whitelisting, library load constraints, and process-spawn limits. The same observation pipeline that feeds enforcement also feeds threat detection.

What the kernel cannot do is reason about why an action is being attempted. It sees a connect() system call. It does not know whether the call is part of a legitimate multi-hop delegation that the gateway already authorized. That context only exists at the L7 layer.

The dual-layer architecture

The architectural integration matters as much as either layer in isolation. Cedar policies authored once, evaluated at the gateway, compiled to BPF for kernel enforcement. The compilation is not magical—only the substrate-relevant subset of Cedar policies compiles. The rest stay at the gateway. Either way, security teams write Cedar; the runtime decides which layer is the right one to enforce at.

This integration is what makes the dual-layer approach operationally sustainable. Without one policy language, you end up with two policy stores, two review processes, two engineering teams, and inevitable divergence between what the gateway permits and what the kernel allows. With Cedar at both layers, the policy you wrote is the policy that gets enforced everywhere.

Why single-layer policy isn’t enough for AI agent security

Policy at the gateway alone defends against unauthorized callers and out-of-scope actions. It does not defend against a compromised agent that has a legitimate token and uses it to do things outside its intended behaviour, like read credential files, exfiltrate data through side channels, and escalate privilege inside its runtime.

Policy at the kernel alone defends against process-level misbehaviour. It does not understand identity or delegation, cannot reason about whether a network connection is part of a legitimate multi-hop chain, and has no way to enforce human-in-the-loop approval flows.

Combined, the two layers cover the threat model that either layer alone misses. A compromised agent with a legitimate token can still call out through the gateway, but its local actions are constrained by the kernel sandbox. A misconfigured Cedar policy at the gateway is mitigated by the substrate baseline. A shadow agent that never registered is observed and contained at the kernel.

For Kubernetes-native enterprises building agent infrastructure into regulated workloads, this is the architecture worth building toward. Gateway policy for what agents are allowed to ask for. Kernel policy for what they are allowed to do. Same language for both.

Going deeper

Multi-layer policy is one piece of a larger problem: making AI agent infrastructure accountable end-to-end. Traceability, authorization provenance, identity and ownership, policy-based governance at scale, and human oversight and intervention—they all have to work together.

Read: The Five Pillars of AI Agent Accountability →

The post Multi-Layer Policy for Securing AI Agents appeared first on Tigera – Creator of Calico.

What’s new in Calico: Spring 2026 Release

Alister Baroi — Tue, 02 Jun 2026 16:10:41 +0000

Kubernetes has come a long way since its debut in 2014. It’s gone from running a couple of containerized microservices to orchestrating fleets of production workloads spanning everything from AI agents to full scale VMs running in pods. As Kubernetes adoption grows, and its use cases stretch to cover more ground, managing its increasingly complex networking and security landscape demands operational maturity and a platform that supports it.

The Spring 2026 release of Calico provides that support in two key areas:

Unified operations across Kubernetes pods and VMs

KubeVirt Live Migration in Bridge Mode allows you to migrate VM workloads with IPs preserved, minimal packet loss, and fast route convergence. VMs can move between nodes for planned maintenance, load balancing and to support high availability without interrupting network connectivity.
Egress Gateway Layer 2 Advertisements (Enterprise exclusive) lets pod traffic egress with IPs from the host’s own subnet so workloads get a stable identity the rest of your network already recognizes eliminating the need for BGP Peering to advertise Egress Gateway IPs.
Policy recommendations for VMs and hosts (Enterprise exclusive) automates and scales policy authoring for Calico-managed workloads running outside of your Kubernetes clusters.
OpenStack Live Migration Improvements lets you migrate VM workloads running in high availability OpenStack environments with minimal risk of service disruption during maintenance. Preloading policies on the target node keeps downtime inside the single-digit-second SLOs regulated workloads require.

Production-grade operations at scale

Whisker Policy Verdict and UI Improvements reveal connectivity blockers in minutes by letting you see the actual tier, policy, and rule that denied a flow.
Calico Load Balancer – Maintenance Mode (Enterprise exclusive) supports graceful node maintenance by excluding backends on nodes marked for maintenance from new Maglev assignments, allowing existing connections to drain naturally. Operators can monitor active connections via Prometheus metrics to determine when it is safe to proceed with node maintenance

What’s new in Calico Open Source v3.32

Two new noteworthy features headline this release: Kubevirt Live Migration and Whisker UI improvements.

KubeVirt Live Migration in Bridge Mode

Running VMs in Kubernetes comes with many challenges, among them the need to preserve a VMs IP during live migration so that network traffic can continue uninterrupted. One way to handle this is with Multus and a bridge CNI, statically configuring the VM’s IP and plumbing it directly into the underlay. That preserves the IP, but the VMs sit outside of Calico which means no microsegmentation, no observability and no shared tooling with pods running alongside these VMs. With Calico v3.32, Calico IPAM assigns persistent IPs to KubeVirt VMs. The IP survives live migration and pod restarts and can be advertised upstream over BGP. VMs share the same Kubernetes-native pod network as containers, with the same CNI, policies, observability, load balancing, QoS, and Layer 7 traffic management.

Live migration in bridge mode ships as a tech preview in Calico Open Source v3.32 and moves to production GA in the June release.

Key Benefits of KubeVirt Live Migration in Bridge Mode:

Migrate VMs With Live Connections: Ensure long-lived TCP sessions such as database queries stay connected across the migration so applications don’t have to reconnect.
Keep VM Workloads Reachable During Maintenance: Live migrate VMs to new nodes without blocking user access to applications.
Monitor VM Migrations in a Shared Dashboard: Track live-migration success rates, duration, and post-move network metrics in the same place you track pod activity.
Run One Network, Not Two: Stop maintaining parallel networking layers with VMs sharing the CNI, policy framework, and observability stack with your pod workloads.

Scenario: Live Migration That Keeps VMs on the Pod Network

The Situation:

A financial services enterprise is consolidating its virtualization estate onto KubeVirt on Kubernetes. The VM count sits in the six figures across dozens of clusters. Live migration is part of routine operations: VMs move between nodes during patching, capacity rebalancing, and host failures. The current workaround is Multus and a bridge CNI plumbed into the underlay, which keeps the IP through the move but leaves the VMs outside Calico’s pod network. The platform team would like to implement microsegmentation and observability for VMs as they do for containerized applications.

The Calico Solution:

Calico IPAM assigns each KubeVirt VM a persistent IP that survives live migration and pod restarts, advertised to the upstream network over BGP. Every VM runs on the same Kubernetes-native pod network as the containers next to it, with the same network policies, observability, load balancing, QoS, and Layer 7 traffic management. When nodes go down for maintenance, VMs move and connections survive. The microsegmentation and observability story stays intact.

Whisker Policy Verdict and UI Improvements

Knowing a flow was blocked by policy is a good start to troubleshooting a connection problem. It does not, however, answer the more important question: what policy is responsible and why? Without knowing the reason a flow is denied, the problem cannot be fixed and tracing a flow’s journey across multiple policy tiers and rules can be unreliable and time consuming, potentially prolonging an outage.

The Whisker updates in v3.32 put the verdict, the matching policy, and the full tier chain right in the flow log view. See all the policies that were invoked by drilling down into a flow. Filter by policy kind, tier, namespace and policy name to find out which flows selected policies take action on.

Key Benefits of Whisker Verdict Improvements:

See the Policy Kind, Tier, and Rule Behind Every Verdict: Surface the full evaluation chain, not just the allow/deny decision.
Filter by Verdict or Policy: Narrow the flow log view to just denies or filter by kind, tier, namespace and name, or any combination, to see which flows a set of policies affects.
Close Policy-Denial Tickets in Minutes: Reduce the troubleshooting path from a lengthy and painstaking analysis of policy layers to a thirty-second click into the matching rule.
Let Application Teams Self-Serve: Trace your team’s own policy denies without waiting on the platform team.

Scenario: The Five-Minute Incident That Used to Take an Hour

The Situation:

A developer on the web-app team opens a ticket: their new service can’t reach the payment service. An on-call platform engineer pulls up Whisker, sees the flow was denied, and starts the usual investigation, checking tiers, scanning policies and cross-referencing rules, while walking the developer through each step. Forty minutes later, they find the issue: the payment tier has a default-deny policy that doesn’t include web-app in its allowed-set.

The Calico Solution:

With the Whisker verdict view, the platform engineer opens the flow log, filters by denied flows for the web-app service, and clicks the first matching row. The verdict panel immediately shows the tier, policy, and rule that produced the deny with enough context to describe the fix. The incident is resolved in five minutes, and the ticket closes with a clear remediation path. The platform engineer then stages the fixed policy and then in Whisker filters by kind, tier and policy name to see if any other flows will be affected, averting potential problems.

Click a denied flow to see the tier, the policy, and the rule that produced the verdict.

ClusterNetworkPolicy: Cluster-Wide Policy Goes Standard

Calico has had GlobalNetworkPolicy for years, cluster-scoped policy that sits above namespace boundaries and gives platform teams a place to define org-wide guardrails, default-deny baselines, and cross-namespace controls. The Kubernetes SIG-Network ClusterNetworkPolicy spec is the upstream community’s version of the same idea, and Calico Open Source v3.32 implements it.

While this is more housekeeping than a headline feature, it has two important implications. First, for the Kubernetes community, Calico’s conformant implementation keeps the spec moving and helps cement cluster-wide policy as a first-class part of the standard. Second, for platform teams already running Calico, ClusterNetworkPolicy provides the same cluster-level control surface as GlobalNetworkPolicy, but utilizes the standard upstream API. This means that tooling built around the spec remains reusable and consistent, regardless of the underlying network implementation.

If you’ve been using GlobalNetworkPolicy in your policy pipelines, you don’t need to do anything; everything keeps working. If you’re starting fresh or building tooling that needs to work across multiple CNIs, ClusterNetworkPolicy is now an option to consider.

Key Benefits of ClusterNetworkPolicy:

Define Policy Cluster-Wide With the Standard API: Use the upstream SIG-Network ClusterNetworkPolicy spec at the cluster level, no vendor-specific CRD required.
Adopt the Standard Without Re-Learning: ClusterNetworkPolicy mirrors GlobalNetworkPolicy in shape and behavior, so platform teams already running Calico’s cluster-scoped policy keep the same mental model and tooling.
Stay Aligned With Where Kubernetes Is Heading: Calico’s early implementation moves the SIG-Network ClusterNetworkPolicy spec toward general adoption, cementing cluster-wide policy as a first-class Kubernetes concept.

Cluster-wide network policy scope, now in the standard upstream API

OpenStack Live Migration Improvements

Calico’s route management work in v3.32 closes the gap that’s kept regulated workloads out of OpenStack live migration. By preloading network policies on the target node ahead of a migration, traffic resumes the moment the VM lands instead of waiting for the network to catch up. This solution, which leverages the same route management code that powers KubeVirt Bridge-Mode live migration, addresses the pain of migration for specific industries that measure downtime in single-digit seconds.

Key Benefits of OpenStack Live Migration Improvements:

Migrate Within Your Downtime SLO: Complete OpenStack live migrations within the single-digit-second SLOs that regulated workloads require.
Live Migration During Active Hours: Run live migration without having to wait for off-hours maintenance windows.

Scenario: Migrating a Trading Workload During Market Hours

The Situation:

A regulated financial-data provider runs a trading workload on OpenStack with a single-digit-second downtime SLO for live migrations. Their current KVM live migration routinely stalls long enough to violate it. The platform team has been limited to performing host maintenance during narrow after-hours windows, and some migrations have simply been deferred indefinitely.

The Calico Solution:

After upgrading to Calico v3.32, the team measures live-migration downtime against their reference workload and finds it consistently within SLO. Host maintenance is now possible during trading hours. Deferred migrations can be scheduled and completed without requiring an after-hours rotation.

The node is ready when the VM arrives reducing downtime

Also in this release: Istio Ambient Mode comes to Calico Open Source

Not new, but new here. Calico Enterprise v3.22.1 bundled Istio Ambient Mesh in the Tigera Operator bringing the production hardened and one hundred percent upstream Istio images with sidecarless mTLS to the Calico stack.

As of Calico Open Source v3.32, the same capability is available in the open-source edition. If your platform team is running Istio in sidecar mode, or has given up on service mesh because of its complexity and resource usage, Istio’s ambient mode is worth a second look. In ambient mode there are no sidecars to wrangle on every upgrade, no per-pod CPU and memory overhead, and a much smaller surface to patch when the next CVE lands.

For the full story including architecture, migration path, and a sidecar-tax deep dive, read the Winter 2026 launch blog post.

What’s new in Calico Enterprise and Calico Cloud

KubeVirt Live Migration in Bridge Mode that is part of Calico Open Source v3.32 is also available in Calico Enterprise where it arrives as a tech preview in v3.23 EP2. For organizations evaluating KubeVirt as their landing spot for VMs, this is the release that makes Calico a supported production target.

Beyond KubeVirt, three Platform-exclusive capabilities help you achieve operational maturity at scale, keeping your policy estate clean, unifying management across cluster and non-cluster workloads, and running load-balancer maintenance without customer impact.

Last Evaluated Metrics, Now via API (Cloud and Enterprise)

As customers extend microsegmentation across Kubernetes, the policy set grows sometimes into the thousands for large enterprises. Workloads change, applications change, and the policies that were essential six months ago may not match traffic anymore. Unused policies don’t announce themselves, they lurk, no longer evaluating traffic, but still on the books, a security and compliance risk that violates the least-privileged posture you’ve spent years building towards.

The Winter 2026 release introduced the “Last Evaluated” metric to surface policies and rules that haven’t matched traffic within a configurable window. Spring 2026 adds API access. Platform teams can now query the metric programmatically and feed it into automated cleanup workflows, compliance reports, scheduled alerts, or command line utilities. The same data that supports a PCI DSS v4.1 audit conversation can now flow into a Prometheus alerting rule or a nightly cleanup-candidate report.

One thing worth being explicit about: the metric tells you whether a policy is evaluating traffic, not whether it should still exist. Customers still make the call about what’s genuinely unused, based on knowledge of the workloads. The API uncovers the candidates. The platform team makes the decision.

Key Benefits of Last Evaluated Metrics:

Automate Policy Hygiene: Pipe Last Evaluated data into Prometheus alerts, scheduled reports, or any other workflow you already run.
Generate Compliance Evidence on Demand: Show auditors that every active rule is in use, the proof PCI DSS v4.1 and similar standards require.
Troubleshoot From the CLI: Query last-evaluated state directly via terminal during an incident, no browser required.
Decommission Unused Policies Without Guesswork: Confidently clean up unused policies, not only to maintain that least-privileged posture but to reduce etcd memory pressure and shorten policy-engine evaluation time.

Scenario: Pruning a Microsegmentation Estate at Scale

The Situation:

A large financial-services platform team has been running Calico for several years. Their policy set has grown to several thousand policies accumulated from successive microsegmentation projects, decommissioned services, and one-off tickets. PCI DSS v4.1 audit is approaching, and the auditor wants evidence that every active rule is actually serving a purpose. Manually reviewing several thousand rules isn’t feasible, and the team can’t safely delete what they don’t understand.

The Calico Solution:

The platform team uses the Last Evaluated Metrics API to pull a list of policies and rules that haven’t matched traffic in the last 90 days. They route the output to a CSV, distribute it to the owning teams, and ask each team to confirm or contest each candidate. Within two weeks the policy set is several hundred rules smaller and the auditor gets the evidence trail directly from the metric output, not from a manual investigation.

Automate your least-privileged posture

Egress Gateway Layer 2 Advertisements

With Egress Gateway Layer 2 Advertisements to Calico 3.23 EP2 eliminates the need for cluster-specific egress IP pools and for BGP peering with ToR switches. You can now assign addresses from the hosts subnet to egress gateways, SNAT egress traffic to the gateway’s host node IP and forward packets using ARP. This means less reliance on coordinating with the network team, more efficient use of routable IP addresses and simplified firewall rules for reduced operational overhead.

Key Benefits of Egress Gateway Layer 2 Advertisements:

Reduce the Need for Coordination with the Network Team: Allocate IPs to new egress gateways without extensive intervention by the networking team significantly increasing deployment velocity.
Forward Packets Using ARP: Decrease operational overhead doing away with BGP session on top-of-rack switches.
Avoid Depleting Routable IPs in Large Environments: Configure a shared set of allow-listed IPs rather than a per-tenant pool preserving scarce routable IPs.
Maintain One Firewall Ruleset: Pod egress IPs come from the host’s own subnet, so the firewall team works with the same address space it already maintains for hosts and VMs making firewall configuration and ongoing maintenance much simpler.

Pod egress lives in the same address space your network team already maintains for hosts and VMs

Scenario: Cluster Scale-Up Without a Firewall Ticket

The Situation:

A financial services platform team exposes a set of cluster services to external partner systems through a corporate firewall. Pod egress traffic uses IPs from a cluster-managed pool that the network team registers in the firewall ruleset. Every time the platform team scales the cluster, the pool changes, the firewall ruleset needs updating, and a change-control ticket flows between the two teams. They meet monthly to reconcile drift.

The Calico Solution:

Egress Gateway Layer 2 Advertisements moves pod egress identity into the host’s own subnet. Pod traffic now exits the cluster using a uniquely identifiable IP address from the host’s routable subnet, which can be allowed by the network firewall. Cluster scale-ups stop triggering firewall changes. The reconciliation meeting comes off the calendar.

Policy Recommendations for VMs and Hosts

Calico’s policy recommendations engine has been a valuable tool in a platform engineers arsenal giving teams a head start authoring policies for Kubernetes pods. Until now, however, they could not take advantage of this productivity boost when it came to hosts running outside a cluster. A new VM or bare-metal workload meant manually combing through flow logs and hand-authoring policies which, at scale, often became a significant microsegmentation bottleneck. Policy Recommendations for VMs and Hosts extends the policy recommendation engine to non-cluster workloads. As of v3.23 EP2, Calico observes traffic to and from VMs and bare-metal hosts generating recommended starting policies just as it does for the workloads running in your cluster. The same review-and-apply process platform engineers use for pods now applies to every workload Calico manages.

Key Benefits of Policy Recommendations for VMs and Hosts:

Dispense with Hand-Rolling Policies for VMs and Hosts: Calico generates starting points for non-cluster workloads from observed traffic, the same way it does for pods.
Scale Microsegmentation Across the Whole Estate: Bring least-privilege policies to hundreds or thousands of non-cluster workloads without writing each one by hand.
Use One Authoring Workflow for Every Workload: Work with the same tooling and the same review pattern across pods, VMs, and bare-metal hosts.

Scenario: Microsegmenting a Thousand VMs Without a Thousand Authoring Tasks

The Situation:

A telco runs Kubernetes workloads for 5G edge services alongside thousands of VMs for legacy signaling systems. The platform team has automated policy recommendations for pods, but every new VM workload comes with a manual policy-authoring task. The team cannot keep pace with the VM side, so default policies on VMs trend toward permissive over time.

The Calico Solution:

With Policy Recommendations for VMs and Hosts, the team’s existing recommendation workflow now covers VMs and bare-metal workloads. Recommendations come in based on observed traffic. The team reviews and applies them at the same rate they already review and apply pod recommendations. Microsegmentation extends across the entire estate without doubling the authoring workload.

One review-any-apply workflow across pods, VMs and bare-metal hosts

Calico Load Balancer – Maintenance Mode (Enterprise Exclusive)

Choosing a software load balancer was already the right call for platform teams who wanted declarative service exposure and consistent-hash session affinity, capabilities Calico Load Balancer has delivered since v3.23 EP1.

With v3.23 EP2, the call gets easier. The fast, predictable failover that a pair of hardware load balancers in HA handles cleanly is now native to Calico’s software LB and ready to take over from that expensive 2018 LB you thought you had to replace. Calico Load Balancer now supports label-based node exclusion. Setting maglev.tigera.io/exclude=true on a node tells Calico Load Balancer to stop forwarding new connections to the backends the node hosts while keeping existing sessions flowing until they complete naturally. Prometheus metrics expose per-node active session counts so operators can watch them decline to zero before proceeding with the drain.

Key Benefits of Graceful Maglev Session Handling:

Patch Nodes During Business Hours: Take nodes out of load-balancer rotation for kernel patches, kubelet upgrades, or hardware work without scheduling around customer traffic.
Drain a Node With a Single Label: Set maglev.tigera.io/exclude=true on a node and Calico Load Balancer stops forwarding new connections to its backends, with no custom scripts or out-of-band coordination.
Drain Without Forcing Disconnects: Active sessions on the excluded node keep flowing until they complete naturally so maintenance doesn’t cut off in-flight work.
Know When It’s Safe to Drain: Prometheus metrics expose per-node session counts so operators can watch them decline to zero before proceeding with maintenance.

Scenario: Maintenance That Customers Never Notice

The Situation:

Scheduled maintenance on a node serving live customer traffic has always been a balancing act. Take the node out of rotation too early and customers with in-flight transactions get cut off mid-session. Wait too long and the maintenance window slips. Most teams have either accepted some level of session disruption or built bespoke tooling to coordinate their load balancer’s health checks with the drain workflow.

The Calico Solution:

The platform engineer labels the node with maglev.tigera.io/exclude=true. From that moment, Calico routes new connections to backends elsewhere in the cluster. Existing sessions on the excluded node keep flowing until they complete, so customers with in-flight transactions finish them naturally. The engineer watches per-node session counts in Prometheus, and when the count reaches zero, drains the node. The maintenance happens. The customers don’t notice.

Same fast, predictable failover as hardware load balancers but Kubernetes native

Get Started with Calico Spring 2026

The Spring 2026 release closes some critical Day 2 operations gaps unifying operations across Kubernetes pods and VMs, collapsing two operational worlds into one network, one policy model and one observability stack. It removes long-standing operational friction and clears the way for scaling infrastructure securely and efficiently helping teams take that next step towards Kubernetes operational maturity.

Environment	Action Required	Documentation Link
Calico Open Source	Upgrade to Calico v3.32	Calico Open Source release notes
Calico Enterprise	Upgrade to Enterprise v3.23 EP2	Upgrade Calico Enterprise documentation
Calico Cloud	Follow instructions to update connected clusters	Upgrade Calico Cloud instructions

To learn more about these new product capabilities and see them in action, schedule a demo.

The post What’s new in Calico: Spring 2026 Release appeared first on Tigera – Creator of Calico.