DEV Community: Alister Baroi

KubeVirt Live Migration Done Right: What it Takes to Run VMs on Kubernetes

Alister Baroi — Thu, 14 May 2026 20:53:44 +0000

Running VMs in Kubernetes sounds like a crazy workaround for avoiding vendor lock-in, and standardizing legacy applications and newer containerized workloads on one control plane with one set of security policies to govern them all. It is, however, a rapidly growing pattern, and KubeVirt live migration — moving running VMs between nodes without downtime — is increasingly central to platform engineering use cases that require full VMs, like on-demand CI/CD pipelines.

KubeVirt is gaining traction as a way to bring VMs into Kubernetes as first-class workloads, managed with the same tools and primitives that platform teams already use for containers. It has, however, introduced some unique challenges.

Here’s the uncomfortable truth about that migration: compute and storage are the easy parts. Networking is where migrations stall, roadblock multiple, and platform teams start questioning whether KubeVirt was the right call in the first place.

If your VMs have no fixed IP dependencies, no VLAN memberships, and no upstream firewall rules scoped to specific subnets, you can migrate them into Kubernetes without losing sleep over the networking layer. If you’re running hundreds or thousands of VMs with IP addresses hardcoded into application configs, DNS entries, and firewall ACLs — and you need to move those VMs to Kubernetes without rewriting any of it — then your networking layer is about to become the most important decision in your migration.

What follows is a technical walk-through of the L2 plumbing that keeps KubeVirt VMs connected when they move between nodes in a production cluster and how it eliminates the need to update your complicated network infrastructure.

Kubernetes Networking Wasn’t Built for VMs

In a traditional hypervisor environment — vSphere, Hyper-V, Nutanix — VMs sit on VLANs and have fixed IPs. Upstream firewalls, load balancers, and DNS records all reference those IPs. A security team owns the VLAN segmentation while the network team owns the routing. This network infrastructure is the accumulated work of many years and forms a static, and somewhat brittle, system of securing hosts and getting traffic to its destination. The Kubernetes networking model, with its dynamic allocation of IPs that are meaningful only inside a cluster, is at odds with this traditional approach. Therein lies the problem.

The upstream network has no direct visibility into the pod network. When a VM is migrated from your existing hypervisor into Kubernetes, its original network segment is not preserved. The VM gets a new IP from the pod CIDR, and every firewall rule, DNS entry, and load balancer config that referenced the old IP is now broken. For a handful of VMs, you can reconfigure your firewall rules and routing manually. For hundreds or thousands reconfiguration becomes not only costly in terms of engineering effort but also injects the risk of breaking critical functionality and introducing security blind spots.

Two Networking Modes, Two Different Problems

Before diving into solutions, it helps to understand how KubeVirt presents networking to VMs. There are two modes for the primary pod interface, and they solve different problems.

Masquerade mode decouples the pod IP from the VM IP. KubeVirt assigns a static IP to the VM internally and uses NAT rules to translate between the two. Live migration works out of the box because the pod IP can change without affecting the VM. The trade-off is that you need a service-level abstraction to reach the VM from outside the pod, which makes this mode impractical for production workloads that need stable, directly-addressable IPs.

Bridge mode is the production-grade option. The pod IP and the VM IP are identical. The VM is directly reachable on the network. No NAT, no service abstraction. But bridge mode introduces a hard problem: when a VM live-migrates to a new node, KubeVirt creates a new pod on the destination. That new pod gets a fresh IP from the CNI. The VM still thinks it has its original IP. The result is a routing mismatch — the network doesn’t know where to send traffic, and the VM’s connections break.

KubeVirt only handles memory and disk migration. This does not matter much in masquerade mode since the VM’s IP is decoupled from the pod’s IP via NAT but becomes a critical consideration in bridge mode. So the CNI has to do three things to ensure nothing breaks: preserve the IP across the pod transition, converge routes so the rest of the network knows the VM has moved, and ensure network policy is in place on the destination before the VM goes live.

Live Migration in Bridge Mode: What Happens Under the Hood

VMs need to move between nodes for a variety of reasons, for example maintenance, load balancing, or high availability. What actually happens during a live migration in bridge mode and why is making it work right so hard?

The 5-step network handover during live migration in bridge mode

The Core Challenge

When a migration is triggered using the KubeVirt command line utility, virtctl, KubeVirt creates a new pod on a destination node chosen by the Kubernetes scheduler in the usual way based on available resources, affinity rules, shared storage, etc. Next, KubeVirt copies the VM’s memory state using libvirt’s pre-copy and post-copy mechanisms.

Then things get a bit interesting.

The source pod continues running during the whole process. From a networking perspective, the same IP now needs to exist in two places temporarily — on the source node (where the VM is still running) and on the destination (where it’s about to go live).

The CNI has to solve three problems simultaneously: IP persistence across pod lifecycles, route convergence during the handover window, and policy continuity so the VM isn’t exposed during migration.

Let’s look at how Calico makes this happen.

IP Persistence: IPAM That Understands VMs

Traditionally, Calico IPAM allocates IPs to pods. The IPAM handle (the ownership ticket for an IP reservation) is derived from the pod’s identity. This works for containers because pods are ephemeral. But a KubeVirt VM is more like a Kubernetes Deployment: you define a VirtualMachine resource, and KubeVirt creates a randomly-named pod to run it. Every time you restart or migrate the VM, the pod changes, but the VM stays the same with the same identity, memory state and the same IP.

Since IPAM assigns the IP to the pod, every migration means a new IP, which defeats the purpose of preserving the VM’s IP and breaks any firewall rules, load balancer configurations or DNS records pointed at this IP.

To fix this, Calico constructs the IPAM handle from the VM’s name instead of the pod’s name ensuring that the reservation persists across pod lifecycles. When a VM migrates and its old pod is destroyed, the IPAM handle survives because it’s tied to the VM identity. When the new pod starts, the IPAM finds the existing handle and reuses the same IP. During migration, the IPAM transiently tracks dual ownership — an active owner on the source node and an alternate owner on the destination — then converges to a single owner once the source pod is cleaned up.

Route Convergence: The GARP Handover

IP persistence ensures the VM keeps its address. Route convergence ensures the rest of the network knows where to find it. Here’s the sequence:

Migration initiated. The CNI watches for migration events in the Kubernetes API. As soon as one is created, it starts preparing the destination node’s networking — policies, routes, interface configuration — so that everything is in place before the VM actually moves.
Memory pre-copy. KubeVirt and libvirt handle the iterative memory copy. The VM continues running on the source node. Traffic continues routing to the source at standard priority.
VM goes live on destination. The VM broadcasts a Gratuitous ARP (GARP) packet announcing “I own this IP now, and I’m on this node.” Felix picks up this GARP and immediately advertises a high-priority route for the VM’s IP via the destination node. The networking layer picks this up and immediately starts steering traffic for the VM’s IP toward the new node, overriding the old route.
Route priority override. This is a critical engineering detail. Normal routing uses a standard metric (1024). During migration, the destination node advertises the VM’s route at a higher priority metric (512). Because the source pod still exists briefly in a post-life state, both nodes momentarily have routes for the same IP. The higher-priority route ensures all traffic is forwarded to the destination, even before the source pod is fully cleaned up.
Cleanup and steady state. Once the source pod terminates, the high-priority route is replaced with a standard-priority route. The source node’s route is removed. The network converges to its normal state with the VM on its new node at the same IP.

Policy continuity

The CNI watches for migration events and uses the lead time to pre-program network policies on the destination node while the memory copy is still in progress. By the time the VM cuts over, its security posture is already in place leaving no gap for unsanctioned traffic to slip through.

This works because Kubernetes network policies use label selectors, not IP addresses. The policies follow the VM’s identity, its labels, namespace, and network membership, not its physical location. When the VM appears on the destination node with the same labels, the same policies apply automatically. One nuance worth noting: while the policy rules carry over, stateful connection tracking (conntrack) does not currently replicate between nodes. Established connections survive because the routes converge, but the destination node evaluates them as new flows. Full conntrack replication is a planned future enhancement.

Portability and Standardization for VMs

If you’re familiar with vSphere, you know vMotion, paired with the vSphere distributed switch, managed live migration networking seamlessly. However, this transparency relies on a vertically integrated stack that is not portable to other cloud environments.

In Kubernetes, the stack is disaggregated. Components like KubeVirt (VM lifecycle), CNI (networking), policy engines (security), and storage operators (disks) each manage their own part. For live migration, the CNI must coordinate with KubeVirt’s migration state machine to manage the VM’s temporary dual-existence across two nodes and converge routing without a centralized controller.

The Kubernetes approach is fundamentally different. It uses open standards: CRI, CNI, CSI, and NetworkPolicy. KubeVirt extends this; VMs are custom resources, managed by kubectl, and scheduled by the same control plane. This approach demands a CNI that understands the unique lifecycle, identity and networking requirements of a pod running a VM but it also makes VMs portable.

It also means that now your containers and VMs can be managed and monitored using the same policies and tools and that means not only operational efficiency but better security and more reliable auditing.

Live migration is one piece of a larger networking story. If your KubeVirt rollout involves bridge mode at scale, multi-cluster topologies, BGP peering, or policy parity across VMs and containers, those decisions compound quickly. We pulled the full picture into The Complete Guide to VM Networking for Kubernetes, a practitioner’s reference covering the architectural choices, networking modes, and operational patterns that determine whether a migration ships or stalls.

Get The Complete Guide to VM Networking for Kubernetes →

The post KubeVirt Live Migration Done Right: What it Takes to Run VMs on Kubernetes appeared first on Tigera – Creator of Calico.

The AI Agent Accountability Crisis: Why Governance Isn’t Keeping Up With Deployment

Alister Baroi — Thu, 14 May 2026 18:08:21 +0000

Every enterprise is building AI agents. Marketing has one summarizing campaign performance. Engineering has one triaging incidents. Customer support has one resolving tickets. Finance has one processing invoices. Each was built by a different team, using a different framework, with different assumptions about security.

Now those agents are talking to each other through agent-to-agent (A2A) communication. The incident-triage agent calls the customer-support agent to check affected accounts. The invoice agent calls an external payment API. The marketing agent queries a data warehouse with customer records.

When something goes wrong (and at this scale of deployment, it will), can you answer:

Who authorized the action?
What policy permitted it?
What was the full chain of events?

If you can’t, you have an accountability gap.

This is part one of a five-part series on AI agent accountability for engineering and security leaders. We’ll work through the gap between agent deployment and governance, the diagnostic framework that exposes it, why your existing tools won’t close it, and the principles you’ll need to evaluate any solution that claims it can.

What is AI agent accountability?

AI agent accountability is the ability to trace, prove, and audit every action an AI agent takes. This includes which policy permitted the agent, which identity initiated it, and what the downstream effects were. It’s the layer above agent communication (MCP, A2A) and agent infrastructure (Kubernetes, GPUs, model serving) that answers the question: who’s responsible when the agent acts?

A landmark 2026 report from Accenture and the Wharton School of Business put the gap bluntly: “ Intelligence may be scalable, but accountability is not. ” As enterprises race to deploy agents across every function, the governance architecture has not kept pace.

Agents are scaling faster than governance

The scale of the problem is not theoretical anymore. Major analyst firms have quantified it:

| Source | Finding |
| McKinsey, 2026 | 80% of organizations have encountered risky behavior from AI agents, actions that were unintended, unauthorized, or outside acceptable guardrails. |
| McKinsey, 2026 | Only one-third (~33%) of organizations report governance maturity. |
| Gartner, 2025 | Over 40% of agentic AI projects will be canceled by the end of 2027 due to escalating costs, unclear value, or inadequate risk controls. |
| ISACA, 2025 | 66% of industry leaders believe formal agent accountability frameworks will become mandatory within the next two years. |
| Dataiku, 2026 | 87% of CIOs report AI agents are already embedded in their enterprises, yet 75% lack real-time visibility into agent operations in production. |

These are not edge cases. This is the mainstream enterprise experience with agentic AI in 2026.

Shadow agents: the new AI agent security gap

A decade ago, enterprises faced “ Shadow IT “. Employees adopting cloud services without IT approval, creating ungoverned sprawl that took years to bring under control. The same pattern is repeating with AI agents, but faster and with higher stakes.

Low-code platforms have made it easy for almost anyone to create an AI agent. Building agents are now table stakes. Scaling them with governance is the real differentiator.

Unlike cloud services, agents don’t just store data. They act. They make decisions, call APIs or MCP servers, access databases, and communicate with other agents. An ungoverned cloud service might leak data. But an ungoverned agent will leak data, take actions on that data, and propagate those actions across other agents in a chain that nobody can trace.

When an AI agent operates without clear ownership or accountability, productivity gains become systemic AI agent security risk. When something goes wrong, there is no clear owner to take responsibility, remediate, or even understand the full blast radius.

The regulatory deadlines

The EU AI Act‘s main body takes effect in August 2026. For enterprises deploying agentic AI, three articles are particularly relevant:

Article 12 requires high-risk AI systems to log their actions to ensure accountability and traceability.
Article 13 requires clear and comprehensible information about how AI systems function and make decisions.
Article 14 requires that high-risk systems are subject to effective human oversight, which is especially important for agentic AI, given the challenges of supervising autonomous agents.

The European Commission may also assess degree of autonomy as a relevant factor when determining whether a system poses unacceptable risks. The more independent your agents are, the higher the regulatory bar.

The US is not far behind. The Colorado AI Act (SB 24-205), delayed to June 30, 2026, requires deployers of high-risk AI systems to implement risk management programs, complete impact assessments, disclose to consumers when AI makes consequential decisions, and report algorithmic discrimination to the state attorney general. It applies to any company doing business in Colorado.

And Colorado is not an unique outlier, it’s just the leading edge. California, New York, Utah, and Texas have also already enacted AI governance laws. At the federal level, 80+ AI governance bills are under consideration in the current Congress. The NIST AI Risk Management Framework is already the de facto US enterprise standard, even where it isn’t legally required.

Compliance deadlines on both sides of the Atlantic are weeks away, not months or years.

The core tension, and why it’s solvable

Enterprises want agent autonomy. That’s the entire point: agents acting independently to drive efficiency and scale. But they also need accountability; knowing what happened, why it was permitted, and who is responsible.

These seem to conflict. More autonomy means less control. More control means less autonomy.

But this is a false dichotomy. As Palo Alto Networks puts it: _ autonomy changes how systems operate, it doesn’t change who’s responsible _.

The same tension existed in microservices a decade ago. Teams wanted independent deployments (autonomy) with reliable service communication (control). The answer wasn’t to choose one over the other. It was to build a governance layer: service meshes, mTLS, observability; that delivered both.

AI agents need the same evolution. The question isn’t whether to give agents autonomy or accountability. It’s whether you have the governance infrastructure to deliver both.

Frequently asked questions

What is the difference between AI agent accountability and AI agent security? Security is about preventing unauthorized actions (blocking the bad). Accountability is about proving why authorized actions were permitted (auditing the good). You need both. A locked door (security) without a sign-in sheet (accountability) leaves your compliance team with nothing to show an auditor.
Why is AI agent accountability a 2026 priority? Three forces are converging this year: rapid agent deployment (87% of CIOs report agents already in production), maturing regulatory regimes (EU AI Act in August, Colorado AI Act in June), and the first wave of public agent-related incidents driving boardroom attention.
Does the EU/US AI Acts apply to my AI agents? If your agent is classified as a high-risk AI system under the Acts, then yes; and Articles 12 (logging), 13 (transparency), and 14 (human oversight), from the EU AI Act, all apply directly. Degree of autonomy is one of the factors regulators consider when assessing risk classification.
Are network policies and RBAC enough for AI agent governance? No. Network policies operate at the wrong abstraction level (pod-to-pod, not agent-to-agent) and produce no audit trail. RBAC requires explicit enumeration that breaks down past about 100 agents, and can’t express attribute-based policies. We’ll cover this in detail in a later post of the series.

Key takeaways

80% of organizations have already encountered risky AI agent behavior, but only one-third have governance maturity to match.
The EU AI Act and Colorado AI Act both take effect in 2026, so accountability requirements are no longer just optional, they are mandatory.
AI agent accountability is the missing layer above agent communication (MCP, A2A) and agent infrastructure (Kubernetes).
Autonomy and accountability are not in conflict, but you need a governance layer to deliver both.

Get the strategic guide for accountable AI agents

We wrote our guide, Accountable AI Agents: A Strategic Guide for AI & Security Leaders Governing Autonomous AI at Scale, to help engineering and security leaders close this gap. No code, no product demos, no fluff. Just the framework your leadership team needs to govern AI agents before the next incident (or the next regulation) forces your hand.

Get the strategic guide for accountable AI agents →

The post The AI Agent Accountability Crisis: Why Governance Isn’t Keeping Up With Deployment appeared first on Tigera – Creator of Calico.

What’s New in Calico v3.32

Alister Baroi — Wed, 13 May 2026 22:23:05 +0000

We’re excited to announce the release of Calico Open Source v3.32!

This release corresponds with Kubernetes v1.36 (Codename Haru) and it goes beyond just sharing a cat as the mascot of the release, it actually extends capabilities and features of Kubernetes to keep you up to date with the latest innovations of the cloud.

This release brings some of the most significant architectural changes in Calico, from live-migrating KubeVirt VMs to eBPF based Maglev load balancer.

Here’s a quick look at everything that’s new:

Breaking Changes & Deprecations

ClusterNetworkPolicy (Alpha2) replaces Admin and Baseline Admin Network Policies: AdminNetworkPolicy and BaselineAdminNetworkPolicy have been removed. You must migrate to ClusterNetworkPolicy before upgrading to v3.32, as Calico will no longer enforce the old resources.
calico-apiserver Deprecated: The aggregated API server is deprecated and will be removed in a future release. It is being replaced by Native v3 CRDs. (Requires MutatingAdmissionPolicy feature gate, Kubernetes 1.30+).

Key Feature Updates

1. KubeVirt VM Live Migration Support

What it does: Allows live-migrating KubeVirt VMs between nodes without dropping TCP connections.
How it works: Achieves IP persistence by binding the IP to the VM name rather than the ephemeral pod.
Activation: Set kubeVirtVMAddressPersistence: Enabled in the IPAMConfiguration resource.

2. Sidecarless mTLS (Istio Ambient Mode)

What it does: High-performance, sidecarless mTLS using Istio ambient mode and Ztunnel. Removes the need to restart workloads or manage third-party components.
Activation: Create a brand new Tigera-operator resource and set its kind to Istio then the Tigera Operator will automatically pick it up and automate the Istio integration! .

3. Maglev Consistent-Hash Load Balancing

What it does: Minimizes flow remapping during backend changes, ensuring long-lived connections survive backend churn and allowing you to bypass external load balancers.
Requirements: Must use the Calico eBPF data plane in Direct Server Return (DSR) mode.
Activation: Add the annotation lb.projectcalico.org/external-traffic-strategy: "maglev" to your Service.

4. Whisker Policy Filtering (Tech Preview)

What it does: The Whisker web console flow-log stream now allows advanced UI filtering by Policy, Namespace/Pod, Verdict (allow/deny), Reporter, and Pending/Staged actions.

Kubernetes ClusterNetworkPolicy (Alpha2)

Breaking change

AdminNetworkPolicy and BaselineAdminNetworkPolicy resources were removed in v3.32 and must be replaced with ClusterNetworkPolicy before upgrading, Calico v3.32 and newer will not enforce them.

The Kubernetes NetworkPolicy resource has long been limited by its namespace scoped perspective. This often created challenges for practitioners attempting to secure clusters forcing them to a flat design that required individual policies for every namespace and a heavy lift for the security team to govern every aspect of the environment’s security. Calico users, however, have avoided these pitfalls through the use of Global NetworkPolicy, policy tiers and ordering. These features enable a “shift-left” approach for Calico users, allowing application teams to manage their own security while administrators and security teams maintain the cluster’s overarching security posture by adjusting policy evaluation precedence.

So we are glad to announce that the upstream Kubernetes SIG-Network introduces a new security model called ClusterNetworkPolicy. This is how cluster admins enforce cluster-scoped Accept, Deny, and Pass rules that namespace owners cannot override, filling the gap that namespace-scoped NetworkPolicy has never been able to address.

Two tiers are auto-created at startup:

Policies use an action field, Allow, Deny, or Pass, and a priority field (lower wins within the tier):

apiVersion: policy.networking.k8s.io/v1alpha2
kind: ClusterNetworkPolicy
metadata:
  name: admin-isolate-prod
spec:
  tier: Admin
  priority: 200
  subject:
    namespaces:
      matchLabels:
        environment: prod
  ingress:
    - action: Deny
      from:
        - namespaces:
            matchLabels:
              environment: dev

Native v3 CRDs (Tech Preview)

Deprecation noticeThe aggregated API server (calico-apiserver) is deprecated in 3.32 and will be removed in a future release. Since this feature is currently in tech preview, migrating is optional.

One of the longest-standing sources of installation friction in Calico has been the aggregated API server (calico-apiserver), a pod deployment that proxied requests to Calico’s v3 resources and generated its own OpenAPI schema independent of Kubernetes. This created ordering dependencies at install time, validation failures without an error if users used older APIs and also caused GitOps tools to fail schema validation after Kubernetes upgrades, and complicated the overall install experience. In Calico 3.32, we’re changing this permanently. Native projectcalico.org/v3 CRDs register Calico’s v3 resources directly as standard Kubernetes CRDs, the same mechanism as any other custom resource.

What changes:

No calico-apiserver host-network pod
No ordering race between CRDs and the API server at startup
kubectl get globalnetworkpolicies works natively, no calicoctl required
OpenAPI schema generated by Kubernetes, so ArgoCD and Flux validate correctly

Note: Beta or GA releases of MutatingAdmissionPolicy feature gate must be enabled in your cluster (Kubernetes 1.32+, not enabled by default in all distributions).

To help you prepare for upcoming Calico changes, we have provided a step-by-step migration guide you can use now.

Enable native v3 CRDs

KubeVirt Virtual Machine (VM) Live Migration

Kubernetes flexibility and ephemeral IP allocation is its strength but when it comes to VMs hosted on Kubernetes it becomes a pain point. Most VMs are transferred from a legacy network and applications that are running on it require static IP or a certain MAC address which is not something Kubernetes offers. Calico v3.32 release brings first-class support for live-migrating KubeVirt VMs between nodes in a cluster, without even dropping a single TCP connection. This means that you can move a VM from any node into any other nodes within a cluster without impacting the network operations.

VM-based IP persistence is controlled by a single field in the IPAM configuration resource:

apiVersion: projectcalico.org/v3
kind: IPAMConfiguration
metadata:
  name: default
spec:
  kubeVirtVMAddressPersistence: Enabled

When enabled, Calico ties the IP handle to the VM name (k8s-pod-network.vmi..) rather than the visual representation of the VM as a pod. The same IP is reallocated to the destination pod during migration, and persists across reboots and pod evictions too.

KubeVirt networking | BGP routing for KubeVirt live migration

mTLS encryption without compromise – Istio Ambient Mode (Tech Preview)

Calico sidecarless mTLS is based on Istio ambient mode and Ztunnel which establishes a secure high performance link between your pods without a need for any sidecars. This is a significant performance boost over the previous Calico integration with Istio given that it eliminates the need to restart your workloads to join the mesh and the resource overhead that used to stack up as your workloads used to grow. On top of that for the new mTLS features you don’t need to install or manage any third-party components, since Tigera operator takes care of all the necessary parts of this integration and provides a smooth transition from an unencrypted environment to a high performance secure mesh.

To enable ambient mode with Calico create the following Tigera Operator resource:

apiVersion: operator.tigera.io/v1
kind: Istio
metadata:
  name: default

Note: Calico publishes its own customized Istio, and Ztunnel images, with Calico-specific patches and CVE-fix dependency bumps applied. These images were previously available only in Calico Enterprise and are now part of Open Source.

Istio Ambient Mode

Maglev Consistent-Hash Load Balancing

Calico v3.32 provides support for Maglev consistent-hash load balancing for external traffic to a Service. This means if enabled, Calico nodes act as stable Equal-Cost Multi-Path (ECMP) nexthops for advertised service IPs, serving as a distributed load balancer. When a Calico node is churned or loses connectivity, service connections will stay healthy. It also means if a backend is added or removed, Maglev remaps only a small fraction of flows, so that more long-lived connections survive churn.

Now you might be wondering why you need such a feature? Using Maglev allows you to ditch your external legacy load balancers and move everything within your cluster.

Opt in per-Service with a single annotation:

apiVersion: v1
kind: Service
metadata:
  name: my-service
  annotations:
    lb.projectcalico.org/external-traffic-strategy: "maglev"
spec:
  type: LoadBalancer
  ...

Calico Maglev also provides monitoring capabilities and builds on top of the prometheus integrations and you can monitor Maglev connection counts via the new Prometheus metric:

felix_bpf_conntrack_maglev_entries_total

Requirements: Calico eBPF data plane in direct server return (DSR) mode.

Add Maglev load balancing to a service

Learn how to enable Calico Prometheus integrations

Whisker Policy Filtering (Tech Preview)

The Whisker web console gains expanded filtering on the live flow-log stream in Calico 3.32.

This video depicts how you can now filter by:

Policy, show only flows that hit a specific policy
Namespace / Pod, narrow to a specific workload
Verdict, filter to allowed or denied flows only
Reporter, filter by source or destination reporter
Pending/Staged actions, see what staged policies would do before enforcing them

This builds on the flow-logs API (Goldmane) and Whisker components shipped in earlier 3.x releases.

New to Calico Whisker? Watch this CalicoCon session to learn more about Calico observability features:

Like to build your own integration with Calico Goldmane?

View flow logs in the Calico Whisker web console

As always a full list of all the changes can be found in the release notes.

The post What’s New in Calico v3.32 appeared first on Tigera – Creator of Calico.

Calculating The Kubernetes Integration Tax: What Your DIY Networking Stack Actually Costs

Alister Baroi — Tue, 05 May 2026 20:07:46 +0000

It was 11:47pm on a Thursday night, and a senior platform engineer at a large North American bank was rolling back a ‘simple’ configuration change. The change itself was small, a routine update approved through the usual review process, but when it was applied, pods began cycling and connections started dropping. For the next three seconds, mobile banking sessions already mid-transaction dropped. Customer support lit up. The incident review the next morning spent most of its time arguing about how the change had been approved. Almost no one asked the harder question: why a configuration change in one place broke something seemingly unrelated.

That question rarely gets a clean answer. What looks like a single layer is usually one knot in a stack of five to seven products including a CNI, network policy, service mesh, observability, threat detection and compliance tooling that come from different vendors and were never designed to operate as one system. Each one works. The gaps between them are where the risk, and the cost, lives.

This is just one example of the Kubernetes integration tax.

What is the Kubernetes Integration Tax?

The Kubernetes integration tax is the cumulative cost in engineer time, security exposure, compliance overhead, and redundant licensing, of running a multi-vendor Kubernetes networking stack that was never designed to operate as one system. It’s a tax in the most literal sense: a recurring charge most enterprises pay every year without ever budgeting for it. It doesn’t appear on a single invoice. There’s no row in your procurement system, no line on your cloud bill, no entry in your SOC 2 evidence package that says “integration tax.” Instead, it accumulates quietly across budgets that report to different leaders, making it nearly impossible to see in any one quarterly review.

What makes the integration tax different from ordinary tooling cost is that it scales with the gaps between products, not the products themselves. The more vendors in your stack, the more surface area between them — and the surface area is where the cost lives. Every new tool you add doesn’t just add a license; it adds a new set of integrations, a new compatibility matrix, a new dashboard for on-call to learn, and a new policy model someone has to reconcile against the four already in place.

Where the Integration Tax Hides

The Kubernetes integration tax lurks between the rows of a typical budget spreadsheet; it may not be immediately obvious, but once you see it, the accumulating costs become hard to ignore.

Glue Work

An organization running five networking tools will typically have two to three engineers dedicated to keeping the integrations intact. Custom webhooks, YAML adapters between tools whose CRDs nearly overlap, Terraform modules that paper over inconsistent authentication models and dashboards that pull from four sources is the work that never appears on a roadmap. Think about how many hours your platform engineers spend on this work and multiply by their fully loaded hourly rate to see how much these disjointed toolsets are costing your organization.

Extended Mean Time to Repair (MTTR)

When L3 policy lives in one product, L7 policy lives in another, flow logs live in a third, and the service mesh enforces its own identity layer, the question “why did this request fail?” becomes a research project. The industry average Mean Time to Identify (MTTI) for container-related incidents is 194 days, with another 64 days to contain. Like glue work, outages cost your organization in expensive engineering hours. They can also mean lost revenue if your applications provide services to paying customers.

Licensing Overlap

A platform team carrying a CNI enterprise license, a service mesh, a network observability product, a threat detection product, and a policy management tool will routinely find that two or three of them have overlapping capabilities. Multiply the cost of each redundant license by the number of clusters you run and the overlap can add up to as much $50K to $75K per year for a typical organization.

Onboarding Drag

A new platform engineer at an organization running a five-tool stack is typically six to nine months from being able to handle on-call rotations independently and resolve complex incidents. They might have to learn four dashboards, three query languages, two policy models, and the undocumented wiring between them. While they ramp up, the on-call load stays on the senior engineers who would otherwise be designing the platform’s next year of work. The cost of the engineering team goes up but capacity stays the same.

Upgrade Cost

Each tool has its own release cadence, its own CVE posture, and its own compatibility matrix with the Kubernetes version underneath. In a five-tool stack, the compatibility math is combinatorial: every upgrade in one product has to be validated against the other four before it ships. There is rarely a clean moment when all five are simultaneously stable, supported, and upgradeable. Factor in the re-integration work needed when compatibility inevitably breaks and the costs can grow exponentially.

What the Math Looks Like

Take a concrete enterprise profile: 150 clusters, 5,250 nodes, 25 platform engineers at a $220K loaded cost, five networking products totaling roughly $400K a year in license. Run the integration tax numbers against it and the picture becomes clear.

The largest line on the table isn’t licensing, it’s engineering. At $550K, platform-engineer time on tooling already exceeds the entire $400K license bill across all five products. Going fully open source and zeroing out that $400K wouldn’t close the gap. $150K of integration overhead would still be on the books, generated by the same five products with the same surface area between them. Add the $640K of security risk that lives in those gaps and you’re carrying nearly $1.2M of cost that licensing has nothing to do with.

The integration tax doesn’t shrink when you stop paying vendors. It shrinks when the surface area between products does.

How Did We Get Here?

No platform team sets out to run seven different networking products. They start with a CNI, because every cluster needs one, and something to log errors. Then they realize they need to ship those error logs to some common storage location so a dashboard can easily access them. They add collectors for metrics. They add tracing tools. The security team starts talking about mTLS so now a service mesh needs to be bolted on. And, by the way, a WAF is a requirement as well according to the compliance auditor.

Every procurement decision makes sense in isolation but the result is not a cohesive and harmonious tooling stack. It’s a sedimentary layer of one-at-a-time decisions, each one rational, and none of them integrated.

How to Calculate Your Kubernetes Networking Cost

If your team hasn’t calculated the Kubernetes integration tax for its own stack asking these three questions is a good first step:

How many engineer-hours per week does your platform team spend on work that only exists because of gaps between networking tools? Multiply by loaded cost to get the glue-work line.
How long does an average network-related incident take to diagnose, and how many products does an on-call engineer typically touch during that investigation? That’s the MTTR line.
How many months does it take for a new platform engineer to be comfortable alone on call, and how much of that ramp is tool-specific rather than Kubernetes-specific? That’s the onboarding line.

For an enterprise managing five separate networking products, the annual cost typically ranges from $800,000 to $1.5 million.

While the exact figure may vary, it highlights a critical reality: the Kubernetes integration tax is a substantial, tangible expense for your organization, and it needs to be quantified. It has been on every platform team’s books for years; the first time it gets a number is the first time it can be planned against.

Interested in calculating the Kubernetes integration tax for your own environment? Contact us to learn more.

The post Calculating The Kubernetes Integration Tax: What Your DIY Networking Stack Actually Costs appeared first on Tigera – Creator of Calico.

VM Migration to Kubernetes: What Breaks and How to Prevent It

Alister Baroi — Wed, 29 Apr 2026 14:20:49 +0000

Here is what nobody putting together the business case for a VM migration to Kubernetes will tell you upfront: the compute is the easy part.

Moving workloads off vSphere and onto Kubernetes is conceptually straightforward. The tooling has matured. The architecture is proven. Compute moves, storage remaps, and the platform team has a plan.

The network is where projects quietly stall.

Not because the technology does not work. Because nobody scoped the network properly before the project started. A platform migration turned into a multi-team coordination exercise. The firewall team needed a change window. The security team needed to review a network placement that changed when it should not have needed to. The application team discovered hardcoded IPs that nobody documented.

Six months later, half the VMs are still on vSphere and the project is technically “in progress.”

This is not a skills gap. It happens at the most mature organisations with capable teams. It is a scoping problem, and it has a specific cause: the gap between how VM networking works and how Kubernetes networking works is wider than it looks on a migration plan.

This post is for the people who approve these projects. Here is what actually breaks, and what to decide before it does.

Why VM Migrations to Kubernetes Stall on the Network

In a traditional hypervisor environment, a VM’s IP address is its identity. Not just technically, as a routing destination, but operationally. It is registered in DNS. Referenced in firewall rules. Watched by monitoring agents. Connected to by peer applications. In regulated environments, it may be in compliance documentation.

Kubernetes was built on different assumptions. Workloads are ephemeral. Addresses come from a range managed by the cluster and mean nothing outside it. Identity is based on labels, not addresses.

When a VM moves into Kubernetes using the default networking model, it gets a new IP. That new IP ripples through everything that referenced the old one. Firewall rules, DNS, security reviews, monitoring, peer applications. None of it is technically hard. The problem is that the platform team owns none of those systems and controls none of those timelines. A migration scoped for one team becomes a coordination exercise across four of them.

This is the networking tax: the hidden cost of a networking model that does not account for what your VMs are already attached to. Your platform team pays it. Your project timeline absorbs it.

This is not an edge case. According to Portworx’s 2024 Voice of Kubernetes Experts report, 58% of organisations plan to migrate some of their VMs to Kubernetes using technologies like KubeVirt. Of those organisations, 65% plan to do so within the next two years.

The migrations are already happening. The scoping decisions are being made now.

Picture a single VM that has been running for several years. Its IP address is in two firewall rules, a monitoring dashboard, a load balancer backend, and a compliance document that was last audited in 2021. The application team has it hardcoded in a config file nobody has opened in three years. That VM is not unusual. It is representative. Now multiply it by two hundred.

Two Ways VM Migration to Kubernetes Breaks in Practice

Two failure modes appear repeatedly in VM-to-Kubernetes migrations. Neither is a surprise once you know to look for them.

The security bottleneck

VLANs in enterprise environments are not just a routing tool. They are a compliance construct. Organisations have spent years segmenting networks to meet PCI-DSS, SOC 2, or internal security policies. Those segments are owned and documented by the security team. Changes to them require sign-off.

When a VM’s network placement changes during migration, even if the VM itself is unchanged, the security team has a legitimate reason to review it. That review takes as long as it takes. The platform team cannot accelerate it.

Platform fragmentation

The compound result of scope expansion and security bottlenecks is partial migration. The VMs with few dependencies move. The ones with static IPs embedded in firewall rules, or in security-reviewed VLAN segments, stay on vSphere.

The organisation ends up running two platforms in parallel with no agreed path to consolidation. The cost reduction and operational simplification that justified the migration are deferred. The project is technically not cancelled, just permanently not finished.

For many organisations, this is not a planning exercise. Active licence renegotiations and uncertainty about long-term hypervisor roadmaps have moved these conversations from the backlog to the boardroom. The migrations are happening now, and the scoping decisions made in the next quarter will shape whether they succeed.

The One Question Every VM Migration to Kubernetes Needs Answered

Before any VM migration project is scoped or budgeted, one question is worth an explicit answer: Is this migration also a modernisation?

If yes, the network redesign is expected work. The additional team coordination is part of the scope. Budget and timeline accordingly, and plan for the security and network teams to be involved from the start.

If no, the networking model chosen for the migration determines whether lift-and-shift is actually achievable or just aspirational.

This sounds like a technology question. It is not. It is a project scoping question that happens to have a technology answer.

The default Kubernetes networking model was designed for cloud-native workloads. Containers with ephemeral addresses and no upstream dependencies. It was not designed for VMs that have ten years of firewall rules, DNS entries, and compliance documentation attached to a fixed IP address.

Using a model designed for the former to move the latter is where projects run into trouble. You are not choosing a networking model for your Kubernetes cluster. You are choosing whether your migration is also a modernisation, and whether your budget accounts for that.

In some organisations, the decision about which networking model to use for VM migration does not belong to the platform team at all. Where VLANs are compliance constructs rather than just routing tools, it is the security team that owns the answer. That is not unusual. It is a reason to get them into the scoping conversation before the architecture is chosen, not after.

Two Networking Models, Two Different Projects

There are two networking models for running VMs in Kubernetes, and the right one depends on what the migration is actually for.

The modernisation model

In a Layer 3 (L3) model, the VM gets a new IP address from the cluster’s address range. Traffic is routed between the cluster and the rest of the network. Once the VM is on the cluster network, it operates the same way containers do. Kubernetes-native tooling applies without modification. The long-term operational model is clean.

The trade-off is explicit: everything that referenced the old address needs to be updated. Firewall, DNS, monitoring, peer applications. This is the work. It is expected and budgeted when modernisation is the goal, and for organisations running a small number of VMs, or VMs with few upstream dependencies, it is often the right choice. The issue is not L3 routing. It is using L3 routing on a VM estate that was never scoped for modernisation.

The lift-and-shift model

In a Layer 2 (L2) model, the existing network segment is extended directly into the Kubernetes cluster. Using KubeVirt to run the VM as a native Kubernetes workload alongside containers, the VM keeps its original IP address. The VLAN it lived in is preserved inside the cluster. From the upstream network’s perspective, the workload did not move. The firewall rule still applies. DNS still resolves. The security team does not need to be pulled into a review they did not schedule for.

Calico L2 Bridge Networks provide this capability. The upstream network continues to see the same workload it always did. No change requests. No reconfiguration. No other teams in the room.

The practical consequence: the platform team owns the migration end to end. No firewall change requests sitting in a queue. No security review on a workload that did not change. No application team dependency. The project delivers on its original scope and its original timeline. That is what lift-and-shift is supposed to mean.

You can migrate a VM from VMware to Kubernetes and it keeps its original IP, stays on its original VLAN. Nothing needs to change. And now it can be protected by Calico network policy and observed through Calico flow logs.

For teams migrating VMs with years of accumulated network dependencies, that continuity is the difference between a migration that completes and one that gets cancelled.

For a technical breakdown of the L2 Bridge mode, see our blog post, Lift-and-Shift VMs to Kubernetes with Calico L2 Bridge Networks, which walks through how the network continuity actually works and includes a recorded webinar.

What Your VM Estate Gains After Migrating to Kubernetes

Choosing the L2 model for migration does not mean your VM estate stays in legacy mode permanently. The migration is the beginning. Day 2 networking is what comes next.

Once the VM is running in Kubernetes, the platform team gains operational capabilities that were not available on the old hypervisor. Traffic visibility, including east-west flows between the migrated VM and other workloads, is available without additional tooling. Security policy can be applied directly to the VM interface using the same constructs the team uses for containers, replacing legacy firewall rules incrementally on whatever timeline the security team sets. This is what Security in Depth looks like in practice. Layered controls applied workload by workload, not a single perimeter replaced in one event.

The VM can also be moved between cluster nodes without network reconfiguration. Same IP, same VLAN, no change to the upstream network. KubeVirt live migration between nodes, without a separate network coordination step.

This is what a policy-first migration enables: the networking and security layer is unified before the workloads move, so day 2 does not require a second migration to get there. Migration and modernisation stay on separate timelines, with separate budgets, managed by separate teams. Neither blocks the other.

A VM that migrated to Kubernetes last quarter can have a new security policy applied today, written by the same security team using the same review process they already have. The migration did not force their hand on timing or tooling. The security team gains a policy model that is version-controlled and auditable. The platform team gains a migration that delivered on its original timeline. Those two outcomes are not in conflict.

Four Questions to Ask Before Approving a VM Migration to Kubernetes

Four questions are worth an explicit answer before any VM migration project is approved:

Is this migration also a modernisation? If yes, budget for multi-team coordination. If no, confirm the networking model supports genuine lift-and-shift.
Which VMs have static IPs embedded in firewall rules or compliance documentation? These are the workloads most likely to stall. Identify them before work begins, not during it.
Who owns the VLAN segments the migrated VMs currently live in? If it is the security team, they belong in the scoping conversation, not the execution phase.
What is the plan for workloads that cannot be modernised? If there is no answer, plan for two platforms running in parallel indefinitely.

Before the project kicks off, run a blast radius analysis on a single VM. Pick one. Map every service connecting to its IP, every firewall rule referencing it, every DNS entry, and who owns each one. That single exercise will tell you more about your true migration scope than any architecture diagram. If the answer fills a spreadsheet, your migration is not a weekend project. If the answer is three lines, start there.

These are not technical questions. They are scope questions. The answers determine whether the migration delivers on its business case or quietly becomes a programme nobody approved.

Interested in learning more about VM Migrations? Talk to an expert about how you can migrate your VM estate.

The post VM Migration to Kubernetes: What Breaks and How to Prevent It appeared first on Tigera – Creator of Calico.

Anthropic Mythos Broke Firefox: 271 zero-day vulnerabilities

Alister Baroi — Thu, 23 Apr 2026 23:12:05 +0000

271 zero-day vulnerabilities. One AI model. One Firefox release. And that's just one of four stories worth your attention this fortnight.

If you run engineering, security, or AI at your company, this article will give you a clear message: AI is no longer something your team uses. It's something your team (and your attackers) deploys. Here are the four moves that matter, and the numbers behind each.

1. Mythos found 271 zero-day vulnerabilities in Firefox 150

On April 22, Mozilla shipped Firefox 150 with patches for 271 security vulnerabilities, all identified by Anthropic's unreleased Mythos model during what Mozilla calls its initial evaluation. For context: across all of 2025, Mozilla patched roughly 73 high-severity Firefox bugs. Mythos delivered almost 4× that count in one evaluation window.

Mythos is distributed under Anthropic's restricted Project Glasswing programme: not a public model, and not available via API.
Firefox 150's security advisory lists 41 CVEs; three of those CVEs are memory-safety roll-ups that bundle many of the 271 individual defects.
The most serious finds were use-after-free bugs in the DOM and WebRTC, the same bug class that has driven browser exploitation for two decades.
Mozilla's caveat (worth quoting verbatim): Mythos did not find any category of bug that an elite human researcher could not have found. The gain is scale and speed, not new capability.

"A gap between machine-discoverable and human-discoverable bugs favors the attacker, who can concentrate many months of costly human effort to find a single bug. Closing this gap erodes the attacker's long-term advantage by making all discoveries cheap." — Mozilla, on the shift in attacker/defender economics.

If Anthropic can hand Mozilla 271 real bugs in a single evaluation, assume your own vendors (and your adversaries) are running similar passes on your stack. The question to ask this quarter is no longer "do we use AI in our security review?" — it is "which of our vendors do, and what does our threat model look like if attackers scale this before we do?"

2. Anthropic launched Claude Design

On April 17, Anthropic released Claude Design, a new Anthropic Labs product built on Claude Opus 4.7. It turns Claude into a design tool that produces real deliverables: prototypes, slide decks, one-pagers, marketing collateral.

Reads your codebase and existing design files to apply brand rules automatically.
Accepts 5+ input formats: text prompts, images, DOCX, PPTX, XLSX.
Exports to Canva, PDF, PPTX, HTML, or a shareable internal URL.
Hands off to Claude Code when a prototype needs real implementation.
Available in research preview across 4 subscription tiers: Pro, Max, Team, Enterprise.
Datadog's quantified claim: prototyping that took one week of back-and-forth now happens in one conversation.

This is Anthropic stepping out of "model behind an API" and into "end-user product", competing directly with Figma, Canva, and the slide-building half of Microsoft 365. If your product organisation still treats model vendors as neutral infrastructure, that assumption has a shorter shelf life than your next budget cycle. The vendor now competes with some of your tooling.

3. Google open-sourced DESIGN.md

Google Labs released a draft open-source specification called DESIGN.md, a format that describes design systems in a way AI agents can read, reason about, and validate against. It shipped alongside Stitch (Google's AI UI tool), but the format itself is platform-agnostic and hosted on GitHub.

Encodes design intent so AI agents stop guessing, "agents can know exactly what a color is for".
Includes built-in WCAG accessibility validation.
Portable across any tool or platform, not locked to Stitch.
Released as a draft spec, open to contribution.

Watch the format, not the tool. Markdown files that AI agents read for persistent context — CLAUDE.md, AGENTS.md, README.md, and now DESIGN.md — are becoming the lingua franca of AI-native workflows. The standard here is being set in public, right now. Whichever spec wins become the default your engineering teams (and their AI copilots) work against for the next decade. API.md, SECURITY.md, and ONBOARDING.md are the obvious next chapters. If you have a design system or a platform team, this is a draft you want an opinion on.

4. OpenAI is quietly building "Hermes" — always-on agents inside ChatGPT

Leaked internal screenshots, surfaced by TestingCatalog between April 6–21, show OpenAI actively developing a platform codenamed Hermes. It adds persistent, 24/7 agents to ChatGPT — agents that run even when you are not at the keyboard.

Custom workflows and skill assembly.
Task scheduling and event-triggered actions.
External messaging connectors: agents can reach users outside ChatGPT.
Role-based templates: leaked screenshots show CTO and CPO archetypes.
Multi-agent orchestration, integrated with OpenAI's existing Workflows builder.
Status: internal beta. No release date confirmed. Unofficial: treat as leak, not announcement.

Signal for engineering leaders: If Hermes ships in the form shown, ChatGPT stops being a chat interface and becomes a runtime for autonomous systems, a direct competitor to Salesforce Agentforce, Microsoft Copilot Studio, and every agent startup built on top of the OpenAI API. Those startups are then competing with their own platform provider, using agent patterns their provider can see in aggregate across hundreds of millions of users. If your 2026 roadmap includes an AI agent strategy built on vendor APIs, this is the risk line item you want in your Q3 review.

The Thread

Four announcements, two weeks, one pattern. AI this fortnight was not about bigger models or cleaner benchmarks. It was about AI doing the work — finding real zero-days in shipped software, producing design artifacts that replace a week of iteration, standardizing how agents read intent, and (in OpenAI's case) running as always-on infrastructure your teams have not yet budgeted for.

The message for leaders is simple: the operational reality of AI is moving faster than most roadmaps were written to handle.

KubeVirt Networking: How to Preserve VM IP Addresses During Migration

Alister Baroi — Tue, 21 Apr 2026 20:55:58 +0000

Organisations are re-evaluating their VM infrastructure. The economics have shifted, the tooling has matured, and the case for running two separate platforms, one for containers, one for VMs, is getting harder to justify. Platform teams that spent years managing hypervisor infrastructure are being asked to consolidate, and most are landing on the same answer: Kubernetes.

KubeVirt makes running VMs on Kubernetes possible. But KubeVirt networking – what happens to a VM’s IP address, VLAN, and security posture when it lands in a cluster – is where most migration plans hit a wall. The reasons go beyond cost:

Most enterprises already run Kubernetes. Containers are already there. Adding VMs to the same platform consolidates tooling, lifecycle management, networking models, and security policy into a single operational model.
Two platforms means double the overhead. Separate infrastructure means separate upgrade cycles, separate monitoring, separate network configuration, and separate on-call runbooks. Platform consolidation has direct operational value.
Kubernetes is mature enough. KubeVirt has reached the point where it’s a viable production choice for enterprise VM workloads.

The decision to migrate is being made. The question is how to do it without causing chaos.

Introducing KubeVirt

KubeVirt extends the Kubernetes API with new custom resource types: VirtualMachine and VirtualMachineInstance. These make VMs first-class Kubernetes objects — scheduled, managed, and observable through the same tools and APIs as containers.

A VM running in KubeVirt runs inside a virt-launcher pod. Kubernetes schedules that pod to a node with available resources, the same way it schedules any other workload. The VM gets CPU and memory from the node. It doesn’t know it moved.

That’s the point: from the VM’s perspective, KubeVirt is invisible. The operating system keeps running. The application keeps running.

KubeVirt virt-launcher pods in Kubernetes

The network is a different story

When you migrate a VM, three things have to follow: compute, storage, and network. Compute and storage are properties of the VM itself — self-contained. KubeVirt handles them by giving the VM a new host and a new storage backend. The VM doesn’t notice.

Dependencies of VM migrations

The network is different. The network isn’t a property of the VM. It’s a property of the VM’s relationship to everything else in the infrastructure.

A VM’s compute dependency is between the VM and its host. A VM’s storage dependency is between the VM and a storage backend. But a VM’s network dependency is between the VM and every other system that knows how to reach it.

That distinction is why networking is where VM migrations stall. This isn’t theoretical. KubeVirt’s own issue tracker documents the problem directly: a user reported their VM’s IP changing after live migration, and a project maintainer confirmed: “Sticky IPs is not implemented.” The network identity doesn’t follow the VM by default.

Lift-and-Shift VMs to Kubernetes with Calico L2 Bridge Networks

Why default KubeVirt networking breaks VM migrations

When a VM lands in Kubernetes using default pod networking:

It receives a new IP address from the cluster’s pod CIDR. A range that exists only inside the cluster
The original VLAN doesn’t exist inside the cluster. Kubernetes has no native VLAN concept in default networking
Pod IPs are only meaningful inside the cluster. The upstream network has no direct visibility into them

From the perspective of every system that previously knew the VM by its address, the VM has disappeared. Something with an unfamiliar IP has appeared inside a cluster that the upstream infrastructure can’t see into.

A VM’s IP address accumulates dependencies over time. By the time you’re migrating it, that IP is embedded in:

Firewall rules — security teams wrote rules allowing or denying traffic to that specific address.
DNS records — the hostname resolves to that IP.
DHCP configuration — the IP is reserved for that VM’s MAC address.
Monitoring and alerting — observability tools are configured to watch that address.
Load balancer backends — upstream load balancers route traffic to that IP.
Application configuration files — other services have that IP hardcoded.
Compliance and audit documentation — security posture records reference that IP in that VLAN.

VLANs add another dimension. In enterprise environments, VLANs aren’t just a way to segment traffic, they’re security boundaries, designed and owned by the security team. Firewall rules are built around VLAN membership. Compliance frameworks reference VLAN placement. When the VM moves to Kubernetes with default networking, that VLAN disappears. The security boundary is gone.

None of this travels with the VM automatically. And every broken dependency requires a different team to fix it.

You can see this directly. Running kubectl exec into the virt-launcher pod of a migrated VM shows the interfaces KubeVirt creates with default pod networking:

2: eth0@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450
inet 10.60.141.196/32 scope global eth0
3: k6t-eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450
inet 10.0.2.1/24 scope global k6t-eth0
4: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 master k6t-eth0

eth0 is a Calico-assigned pod CIDR address — meaningful only inside the cluster. k6t-eth0 is KubeVirt’s internal masquerade bridge. tap0 connects to the VM’s virtual NIC. The VM’s original IP is gone. The upstream network sees 10.60.141.196, not the address any firewall rule, DNS record, or application config was written for.

A lift-and-shift becomes a multi-team project

Here’s what was planned: the platform team moves the VM. One team. The migration is invisible to the rest of the business.

Here’s what actually happens with default pod networking:

The IP changes. The network team needs to rewrite firewall rules and update DNS
The VLAN disappears. The security team needs to review the new network placement and approve it
Application config breaks. The application team needs to update config files and hardcoded references

Every one of these requires sign-offs, tickets, and coordination

A migration budgeted as a lift-and-shift gets delivered as a network redesign. Per VM. At scale, the coordination cost makes migration impractical.

This is where VM migration to Kubernetes stalls, not because the technology doesn’t work, but because the organisational cost exceeds what anyone planned or funded for.

How to preserve VM IP addresses and VLANs in Kubernetes

Think about what the problem really is. The VM had a home on the network. A specific IP, a specific VLAN, a specific place in the security model. When it moved to Kubernetes, that home disappeared. Default pod networking gave it a new address in a new network that nothing outside the cluster knows about.

Calico L2 Bridge Networks solve this by doing the opposite. Calico L2 Bridge Networks extend a VM’s original Layer 2 network segment – including its IP address, VLAN, and MAC address – directly into a Kubernetes cluster via a node-level bridge, so the VM’s network identity survives the migration unchanged. Instead of putting the VM on Kubernetes’s network, it brings the VM’s original network into Kubernetes. The physical VLAN the VM lived on gets extended directly into the cluster via a bridge on the node. The VM connects to that bridge through a secondary interface, and the VM preserves its original IP address, the same VLAN, and the same MAC address it had before the migration.

Nothing on the outside knows anything has changed. The firewall still talks to the same IP. DNS still resolves to the right place. The monitoring dashboard still shows the right host. The application that had the IP hardcoded still connects. The security team’s VLAN boundary still exists — it just now exists inside Kubernetes too.

L2 Bridge Mode with Calico by Tigera

You can see the difference at the interface level. With Calico L2 Bridge, that same virt-launcher pod now looks like this:

2: eth0@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450
   inet 10.60.141.196/32 scope global eth0
3: k6t-eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450
   inet 10.0.2.1/24 scope global k6t-eth0
4: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 master k6t-eth0
5: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
   link/ether 52:54:00:3a:7f:21 brd ff:ff:ff:ff:ff:ff
   inet 10.10.5.42/24 brd 10.10.5.255 scope global net1

net1 is the secondary interface connected to the L2 bridge that Calico manages on the node. That’s the VM’s original IP10.10.5.42, on its original subnet, with its original MAC address. The pod-side interfaces are still there, KubeVirt still needs them, but the VM’s actual network identity is preserved on net1. That’s the interface the rest of your infrastructure talks to.

Why a secondary interface and not the primary?

KubeVirt manages the VM’s primary network interface through the virt-launcher pod. That primary interface has two modes: masquerade and bridge. Masquerade NATs all VM traffic through the pod’s IP. The VM is hidden behind the pod address. Bridge mode connects the VM to the pod network bridge. Closer, but still the pod network, not your VLAN.

Neither mode has a way to extend an external VLAN directly to the VM. They’re designed for pod networking, not for preserving legacy network identity.

The secondary interface is what makes this work. Calico attaches an additional interface to the VM and that interface connects to the bridge Calico created on the node, which connects to the trunk carrying your VLAN from the physical switch. The VM’s traffic on that interface goes directly to the right network segment without any translation or tunnelling.

How Calico sets it up

The setup is declarative. You define what you want, Calico handles the plumbing.

You create a network resource in Kubernetes that tells Calico which VLAN to bridge and how to map it. Calico reads that and creates the bridge on the node automatically, attaches the trunk interface, and starts tracking the VM’s IP. A NetworkAttachmentDefinition tells KubeVirt to attach the secondary interface at boot. The VirtualMachine spec references the secondary network, and when the VM starts, net1 appears with the right IP.

Migration tools like Forklift (for OpenShift Virtualisation) handle the mapping of existing VM interfaces to the cluster definitions and register the VM’s IP with Calico before migration. From that point, Calico owns the IP, tracking it, keeping routing state correct, and following the VM if it moves between nodes.

Multiple VLANs can run through the same trunk-backed bridge. You don’t need separate infrastructure per VLAN, the same bridge handles them all.

What you gain after the migration

Getting the VM into Kubernetes without breaking anything is the primary goal. But once it’s there, a few things become available that weren’t possible in the hypervisor environment.

Network visibility

In a traditional hypervisor setup, getting visibility into what a VM is actually doing on the network usually means deploying a separate agent, a network tap, or a dedicated monitoring tool per host. That visibility comes with the unified platform that Calico provides. Calico gives you traffic flow data, communication patterns, and network behaviour for VM interfaces without anything extra to install or manage.

Security policy you can actually version control

The firewall rules that protected this VM before migration were probably sitting in a security team’s ticketing system, applied manually to a physical or virtual firewall. They worked, but they weren’t portable, they weren’t reviewable in a pull request, and they weren’t easy to audit.

With Calico, you can express the same security posture as Kubernetes-native network policy. Labels, selectors, declarative YAML. You don’t have to do this immediately as part of the migration. The VLAN boundary still exists, the existing firewall rules still apply. But when the security team is ready to modernise the policy model, the tooling is already there.

Live migration that doesn’t touch the network

Once a VM is running in Kubernetes, it can move between nodes for patching, rebalancing, hardware failures, and the network configuration moves with it. Calico tracks the IP and updates routing state automatically. From the outside, nothing changes. The VM is just on a different node now.

Making VM migration to Kubernetes practical

Migration projects fail when the platform team scopes a job as “move the VM” and it turns into “rebuild the network.” That scope creep isn’t a technical failure, it’s what happens when you use a networking model designed for stateless containers to move workloads that were designed around stable, long-lived network identities.

Calico L2 Bridge Networks solve the right problem: keep the network identity intact during the move, let the migration stay within the platform team’s remit, and leave modernisation for when it’s actually planned and funded.

Move now. Modernise later. On your own timeline.

Watch our walkthrough to learn more: Calico L2 Bridge Networking for Virtual Machines

The post KubeVirt Networking: How to Preserve VM IP Addresses During Migration appeared first on Tigera – Creator of Calico.

Your AI Agents Are Autonomous. But Are They Accountable?

Alister Baroi — Fri, 17 Apr 2026 10:39:24 +0000

Why accountability, not capability, is the real bottleneck for enterprise agentic AI, and what security leaders need to do about it before regulators force the issue.

Every enterprise is building AI agents. Marketing has one summarizing campaign performance. Engineering has one triaging incidents. Customer support has one resolving tickets. Finance has one processing invoices. And increasingly, those agents are talking to each other: calling tools, accessing databases, delegating tasks across complex multi-hop chains.

But here’s the question nobody wants to hear at 3 a.m. when something goes wrong: who authorized that action, what policy permitted it, and what’s the full chain of events?

For most enterprises, the honest answer is: nobody knows. That’s not a governance problem — it’s an AI agent accountability crisis.

Agents Are Scaling Faster Than Governance

The data paints a stark picture. McKinsey research found that 80% of organizations have already encountered risky behavior from AI agents. These actions were unintended, unauthorized, or outside acceptable guardrails. Yet only about one-third of organizations report meaningful governance maturity. Gartner predicts that over 40% of agentic AI projects will be canceled by the end of 2027 due to escalating costs, unclear business value, or inadequate risk controls.

This isn’t a future problem. This is the mainstream enterprise experience with agentic AI right now. And the pattern should feel familiar. A decade ago, enterprises faced “shadow IT,” where employees adopting cloud services without IT approval created ungoverned sprawl that took years to bring under control. Today, agentic architectures risk creating a new back door for “shadow AI,” and the stakes are higher. Unlike cloud services, agents don’t just store data; they make decisions, call APIs, access databases, and propagate those actions across other agents in a chain that nobody can trace.

The Regulatory Clock

Compliance deadlines on both sides of the Atlantic are months away. The EU AI Act’s main provisions take effect in August 2026, requiring action logging, transparency, and human oversight for high-risk AI systems. In the US, the Colorado AI Act (being the leading regulation) takes effect in June 2026, mandating risk management programs and impact assessments for high-risk AI. And Colorado isn’t the only state: California, New York, Utah, and Texas have already enacted AI governance laws, and there are 80+ AI governance bills under consideration in the current US Federal Congress.

Two-thirds of industry leaders believe formal agent accountability frameworks will become mandatory within the next two years. The question isn’t whether these requirements are coming. It’s whether your organization will be ready.

Key Pillars for Agent Accountability

Not all “governance” is created equal. Many enterprises believe they have agent governance because they have network policies or an API gateway. But governance without accountability is security theater; it might prevent some bad outcomes, but it can’t prove why good outcomes were permitted, trace what happened when something goes wrong, or satisfy an auditor asking for evidence.

True agent accountability requires five distinct capabilities working together:

Traceability — Can you trace what happened, end to end? When Agent A calls Agent B, which calls Tool C, which accesses Database D, can you reconstruct the entire chain with timestamps and outcomes at every hop? Without traceability, incident response is guesswork.
Authorization provenance — Can you prove why it was permitted? Not just “Agent A was allowed to call Agent B,” but “Agent A was allowed to call Agent B because Policy X grants agents with capability Y access to agents with risk-level Z.” This is the difference between a lock on the door and a sign-in sheet.
Identity and ownership — Who owns this agent, and who is responsible when it acts? Every agent needs a verified identity and a clear human owner. Without it, accountability diffuses across components, and diffused accountability is no accountability at all.
Policy-based governance at scale — Does your security model survive agent #101? With 10 agents, you can manage permissions by hand. With 100, you can’t. Scalable governance requires declarative, attribute-based policies that grow with the network, not against it.
Human oversight and intervention — Can a human review, approve, or override? Effective oversight means visibility into what agents are doing, the ability to review interactions after the fact, and the power to modify policies or revoke access in real time.

If you’re missing any one of these pillars, you have a gap that will surface during your next incident, audit, or regulatory review.

Why Existing Approaches Can’t Deliver AI Agent Accountability

Enterprises aren’t starting from zero; most have invested in network policies, API gateways, RBAC, and protocols like MCP and A2A. The problem isn’t a lack of tools. It’s that these tools were designed for model outputs (a world where services are deterministic, communication patterns are predictable, and humans make all the decisions), not autonomous actions.

Network policies operate at the wrong abstraction level for agent accountability. They can say “pods in namespace A can reach pods in namespace B,” but they can’t say “Agent A with risk-level=low can only call agents with risk-level=low.” They have no concept of agent identity, capabilities, or policy attributes, and they produce no audit trail.

API gateways handle north-south traffic but don’t understand the east-west, multi-hop nature of agent-to-agent communication. MCP and A2A solve the how of agent communication, but explicitly assume someone else handles the who and the why. RBAC works at small scale but can’t express the nuanced, attribute-based policies that agent governance requires.

The industry has solved agent communication and agent infrastructure. What’s missing is the accountability layer — the control plane that answers three questions for every agent interaction: Who authorized this? What policy permitted it? What’s the full record?

The AI Governance Gap Is Growing

The enterprises that thrive in the agentic era won’t be the ones that deploy the most agents. They’ll be the ones that can prove their agents are operating within policy, trace every interaction end to end, and answer the question: who’s accountable when the agent acts?

We wrote a strategic guide to help you get there. Our whitepaper, Accountable AI Agents: A Strategic Guide for AI & Security Leaders Governing Autonomous AI at Scale, breaks down the full framework — the five pillars of agent accountability, why existing approaches leave gaps, and the architectural principles your governance platform needs to deliver. It also provides the solution, the accountability maturity model, which guides how to fix these security and accountability gaps. No product demos, no fluff. Just the blueprint your leadership team needs before the next incident or regulation forces your hand.

Get the strategic guide for accountable AI agents →

The post Your AI Agents Are Autonomous. But Are They Accountable? appeared first on Tigera – Creator of Calico.

Deployed Is Not the Same as Ready: How Mature Is Your Kubernetes Environment?

Alister Baroi — Thu, 16 Apr 2026 22:00:25 +0000

Kubernetes adoption is no longer the challenge it once was. More than 82% of enterprises run containers in production, most of them on multiple Kubernetes clusters. Adoption, however, does not mean operational maturity. These are two very different things. It is one thing to deploy workloads to a cluster or two and quite another to do it securely, efficiently and at scale.

This distinction matters because the gap between adoption and Kubernetes operational maturity is where risk accumulates. Operationally mature organizations ship faster, recover from incidents in minutes instead of hours and consistently pass compliance audits. They spend less time dealing with outages and more time delivering new services to their customers.

So what separates maturity from adoption? It comes down to a handful of foundational capabilities that, when done well, result in measurable business impact. Operational maturity — the ability to run Kubernetes workloads securely, efficiently, and at scale, with consistent policy enforcement, cross-cluster observability, and automated incident recovery — is not a destination; it is a continuous process of strengthening the architectural pillars that keep your Kubernetes environment production-ready.

What does operational maturity look like?

Operational maturity spans several interconnected areas from Kubernetes security best practices to observability and multi-cluster connectivity that, taken together, determine how resilient, secure, and observable your Kubernetes environment truly is. One practical way to measure this is to walk through the capabilities your environment either has or does not have yet.

A running vs an operationally mature Kubernetes environment

Can you effectively isolate workloads from each other?

The flat network default which allows pods to be created, destroyed and moved on the fly (a core Kubernetes capability) also creates a wide-open door for lateral movement if a workload is compromised.

A tiered policy model addresses this by organizing network policies into layers of precedence, each owned by a different team. Security teams define high-priority guardrails—for example, blocking traffic to malicious destinations, enforcing tenant isolation—while platform teams secure infrastructure components and developers write fine-grained rules for their own applications. This separation of duties eliminates policy sprawl and ensures that a developer-created rule can never accidentally override a critical security baseline.

Do you have a zero-trust security policy with pod to pod encryption and workload identity?

In addition to isolation, security means a zero trust posture, and that in turn means mTLS for internal cluster traffic. mTLS has become a hard requirement, both for regulators and for security teams that have learned the hard way what unencrypted east-west traffic costs when something goes wrong.

For organizations that have given up on service mesh, Istio ambient mode is worth a look. It delivers automatic mTLS and SPIFFE-based workload identity across all traffic without the resource cost of sidecars. L7 capabilities such as traffic shaping and advanced observability can be layered in selectively only for the services that need them.

Security is the foundation and non-negotiable starting point on the journey towards a mature Kubernetes posture.

Does your ingress solution have all the capabilities you need without relying on vendor-specific annotations?

The retirement of Ingress NGINX Controller was a wake up call for many organizations making them realize that ‘good enough’ is, in fact, not good enough. Migrating to a robust and future proof implementation of Gateway API is one more step along the road to operational maturity.

Ingress and traffic management are evolving rapidly. The Kubernetes Ingress API served its purpose for years, but reliance on annotations, limited protocol support, and a single-controller model have become constraints at scale. The Gateway API replaces it with a role-oriented model. This is more than a technical upgrade. It is a shift not only towards more granular and comprehensive traffic control but towards decentralized management where cluster administrators control the infrastructure and development teams define their application-specific routing rules.

Is egress getting the attention it needs?

Egress traffic management is the often overlooked sibling of ingress control. Without dedicated egress controls, outbound traffic from your cluster uses the node’s IP address, which means different tenants and workloads become indistinguishable to the outside world. This makes audit trails unreliable, complicates compliance, and creates real security exposure.

An egress gateway architecture assigns each tenant or namespace a dedicated, static IP address for outbound traffic. External services can then allowlist those specific addresses, firewall rules become deterministic, and your security team can trace any outbound connection back to the workload that initiated it.

If your pods need to access external endpoints egress control deserves a place on your maturity roadmap, not on the back burner.

How do you connect your clusters?

It is rare to find organizations with just one Kubernetes cluster in production. Spectro Cloud reported that large enterprises operate more than 20 clusters across five or more cloud environments. If you are running AI workloads that are more than a simple API for the company chatbot, deploying a multi-cluster architecture that isolates GPU heavy training jobs from inference endpoints is a baseline expectation.

Unfortunately, the traditional multi-cluster architecture, which relies on external DNS and load balancers, exposes your internal services and presents a real risk. Beyond the security exposure, it introduces operational drag that compounds with every cluster you add. We are talking about frustrating DNS propagation delays, security policies that have to be manually synchronized across environments and, of course, the inevitable configuration drift.

Cluster mesh architecture, with its unified observability, Kubernetes-native service discovery that does not rely on external DNS and consistent inter-cluster security policies, is what can keep a complex multi-cluster environment from becoming a liability. Multi-cluster done well is a reliable measure of operational maturity.

Are you relying solely on hardware load balancers

Hardware load balancers were built for a pre-Kubernetes world. They have no native concept of pods, services, or namespaces, and every configuration change typically requires a ticket, a separate team, and a procurement cycle. As Kubernetes becomes the default platform for production workloads, that operational friction compounds. The more clusters you run and the more latency-sensitive your workloads become, the more the limitations of hardware-centric load balancing show up in your incident logs and your budget.

A Kubernetes-native load balancer replaces the appliance with software that runs inside the cluster and understands its abstractions. Capacity scales horizontally by adding nodes, not by upgrading hardware. Configuration uses standard Kubernetes resources, which means no separate management console and no version drift between your cluster and your load balancer. For teams managing payment processing, trading systems, or real-time data pipelines, the combination of eBPF-based forwarding, consistent hashing, and graceful node draining delivers the reliability of enterprise appliances without the operational overhead.

Is your team still stitching together clues from kubectl and scattered logs, or do you have a single, unified view across your entire environment?

Kubernetes environments can fail quietly. Services degrade, traffic patterns shift, and workloads compete for resources in ways that are invisible without the right instrumentation in place. In a single cluster, experienced engineers can often piece together what is happening from logs and metrics. Across multiple clusters, namespaces, and workload types that approach becomes highly inefficient and costly. Managing cost and efficiently tracking down problems is even harder, and more imperative, now that AI workloads, with their training jobs, inference endpoints and non-deterministic agents, often share infrastructure and resources with business-critical services.

Unified observability is essential to keeping all the moving parts manageable. Without Kubernetes-aware telemetry that is enriched with metadata about namespaces, services, and workload identity teams are operating blind. Mature observability means you can detect anomalous traffic patterns in real time, trace requests across cluster boundaries, and generate the audit evidence that compliance frameworks demand. It turns reactive firefighting into proactive operations. Organizations that strive for operational maturity cannot do without it.

Where are you on the journey to operational maturity?

Where do you stand?

No organization achieves Kubernetes operational maturity overnight, and not everything needs to be optimized immediately. What matters is knowing where you stand today so you can prioritize items that will have the greatest impact on your security posture, operational efficiency, and ability to support your current and future workloads. Whether you are still relying on default-allow networking, beginning to explore egress controls, or already running a multi-cluster mesh, there is always a next step on the maturity curve.

Read our ebook, Building Resilient Multi-Cluster Kubernetes to get a practical framework for closing the gap between Kubernetes adoption and operational readiness.

The post Deployed Is Not the Same as Ready: How Mature Is Your Kubernetes Environment? appeared first on Tigera – Creator of Calico.

Beyond the Prompt: AI Agent Design Patterns and the New Governance Gap

Alister Baroi — Wed, 15 Apr 2026 19:25:41 +0000

If you are treating Large Language Models (LLMs) like simple question-and-answer machines, you are leaving their most transformative potential on the table. The industry has officially shifted from zero-shot prompting to structured AI agent design patterns and agentic workflows where AI iteratively reasons, uses external tools, and collaborates to solve complex engineering problems. These design patterns are the architectural blueprints that determine how autonomous Agentic AI systems work and interact with your infrastructure.

But as these systems proliferate faster than organizations can govern them, they introduce a critical AI agent security risk: By the end of 2026, 40% of enterprise applications will feature embedded AI agents, and those teams will urgently need purpose-built strategies to govern this new autonomous workforce before it becomes the next major shadow IT crisis.

Before you can secure these autonomous systems, you have to understand how they are built. Here is a technical breakdown of the current AI Agent design patterns you need to know, and the specific security blind spots each design pattern creates.

1. The Foundational Execution Patterns

Building reliable AI systems comes down to how you route the cognitive load. Here are the three baseline structural patterns:

A. The Single Agent (Tool Use)

In this pattern, a single LLM is equipped with access to external, deterministic tools (APIs, databases, bash environments, or the Model Context Protocol).

How it works: The agent receives a prompt, realizes it lacks the necessary context, calls a tool, ingests the output, and formulates a final response.
The Governance Challenge: When an agent is granted API keys to query your cluster, it operates with implicit trust to access that data. If compromised via prompt injection, that single agent becomes an unmonitored vector for data exfiltration.

B. The Sequential Agent (The Assembly Line)

When a single agent fails at a complex task, we break the task down into a pipeline. Sequential agents operate in a linear hand-off, where the output of Agent A becomes the input of Agent B.

How it works: You deploy specialized micro-agents. Agent 1 extracts data, Agent 2 analyzes it, and Agent 3 formats the final report.
The Governance Challenge: As data flows between agents, maintaining an audit lineage becomes incredibly complex. You cannot easily trace which tools Agent 2 called based on Agent 1’s corrupted input.

C. The Parallel Agent (Concurrency & Voting)

To combat the latency of sequential pipelines, the Parallel pattern fans out tasks to multiple specialized agents simultaneously.

How it works: A router agent delegates sub-tasks to multiple worker agents concurrently. Once they finish, a “Judge” or “Synthesizer” agent aggregates the parallel outputs into a cohesive result.
The Governance Challenge: You now have multiple autonomous agents acting concurrently. Traditional security tools built for deterministic services cannot provide the visibility or control required for these non-deterministic autonomous actions.

2. The Advanced Cognitive Patterns That Complicate AI Agent Security

To make agents truly autonomous, developers are giving them the ability to “think” about their own work. These cognitive patterns drastically improve output quality, but introduce severe behavioral unpredictability.

A. The Reflection Pattern (Critic & Refiner)

The Reflection pattern pairs a Generator agent with a Critic agent.

How it works: The Generator outputs a first draft. The Critic evaluates it against guardrails, and the Generator iteratively refines the output until it passes the Critic’s checks.
Why it matters: Wrapping an older model (like GPT-3.5) in a Reflection loop often produces higher-quality, more reliable code than a zero-shot prompt to a cutting-edge model (like GPT-5.4 Pro).

B. The Planning Pattern

For highly ambiguous goals, agents need the autonomy to devise their own roadmaps.

How it works: Given a high-level goal, the Planning agent decomposes it into a Directed Acyclic Graph (DAG) of sub-tasks. It executes the plan step-by-step, adapting dynamically if a step fails (e.g., “Dependency missing, re-routing to fetch from alternate repo”).
The Governance Challenge: AI agents don’t follow scripts. They autonomously choose which tools to call, which data to access, and which agents to collaborate with, making static security models completely obsolete.

3. The Cold Start Problem: Why AI Agent Governance Can’t Wait

The ultimate evolution of these patterns is Multi-Agent Collaboration , a “society of minds” system where diverse agents with distinct personas (The Architect, The Security Engineer, The QA Tester) debate, share data, and execute code collaboratively across boundaries. AI agent security — the discipline of discovering, controlling, and auditing what autonomous agents can access and do — requires a fundamentally different approach than traditional application security. Each pattern described above introduces distinct risks, and in combination, they create attack surfaces that traditional security models were never designed to handle.

But as AI/ML engineering teams race to deploy and scale these Agent-to-Agent (A2A) architectures, most enterprises realize they don’t have any inventory of the AI agents running in their environment, including shadow agents deployed by teams outside official channels. A massive infrastructure challenge arises: How do these agents communicate securely? You cannot govern what you cannot see.

Whether your AI agents run in Kubernetes, cloud environments, on-premises, at the edge, or on developer laptops, governance that only covers one environment is governance with holes.

Enter Tigera Agent Governance (TAG)

We are moving past the era of human-in-the-loop chat interfaces into human-on-the-loop autonomous systems. To bridge this gap, Tigera is introducing TAG: the platform with the discipline to discover, authenticate, authorize, enforce, and audit every agent action, wherever agents run.

TAG is the first platform to own the full five-pillar framework required for modern AI workloads:

Discovery: Central registry and auto-discovery of shadow agents across your infrastructure.
Authentication: Cryptographic trust giving every agent a verified identity.
Authorization: Default-deny, fine-grained access control with tool-level binding.
Enforcement: Real-time enforcement that enables development velocity without bureaucratic blockers.
Governance: Full audit lineage, service graph visualization, and board-ready compliance reporting.

Your AI agents are making decisions. Do you know what they’re authorized to do? Do not wait for an autonomous agent to go rogue. Secure your next-generation architecture with universal governance built for the Agentic AI era.

→ Request Early Access to TAG

The post Beyond the Prompt: AI Agent Design Patterns and the New Governance Gap appeared first on Tigera – Creator of Calico.

How to Stub LLMs for AI Agent Security Testing and Governance

Alister Baroi — Thu, 02 Apr 2026 14:15:28 +0000

_ Note: The core architecture for this pattern was introduced by Isaac Hawley from Tigera._

If you are building an AI agent that relies on tool calling, complex routing, or the Model Context Protocol (MCP), you’re not just building a chatbot anymore. You are building an autonomous system with access to your internal APIs.

With that power comes a massive security and governance headache, and AI agent security testing is where most teams hit a wall. How do you definitively prove that your agent’s identity and access management (IAM) actually works?

The scale of the problem is hard to overstate. Microsoft’s telemetry shows that 80% of Fortune 500 companies now run active AI agents, yet only 47% have implemented specific AI security controls. Most teams are deploying agents faster than they can test them.

If an agent is hijacked via prompt injection, or simply hallucinates a destructive action, does your governance layer stop it? Testing this usually forces engineers into a frustrating trade-off:

Use the real API (Gemini, OpenAI): Real models are heavily RLHF’d to be safe and polite. It is incredibly difficult (and non-deterministic) to intentionally force a real model to “go rogue” and consistently output malicious tool calls so you can test your security boundaries.
Mock the internal tools only: You test your Python or Go functions in isolation, but you never actually test the “Agent Loop”—meaning you aren’t testing if the harness correctly applies the user’s OAuth tokens or Role-Based Access Control (RBAC) to the LLM’s requested tool call.

Recently, Isaac Hawley introduced a much better pattern: The Stub Model —a way to stub your LLM for testing that makes your security assertions completely deterministic.

A Stub Model (or mock LLM) is a deterministic, non-AI replacement for a real language model that you inject into your agent harness during testing. It returns hardcoded tool-call requests — including deliberately malicious ones — so you can prove that your security layer correctly intercepts and blocks unauthorized actions without relying on a live model API.

The Core Concept: A “Malicious” Router for AI Agent Security Testing

Instead of hitting a real model API during tests, we inject a StubLLM that implements our system’s core LLM interface.

The stub doesn’t use any AI. Instead, it parses incoming prompts for specific testing triggers and returns hardcoded, completely predictable tool calls. Crucially, this forces your agent harness to actually execute the real underlying tool pipeline. You aren’t just faking a final text response; you are making the LLM trigger your application’s real execution loop.

From a governance perspective, this is a superpower. You can program the stub to request highly privileged actions (like drop_database orread_all_users), and then write strict, lightning-fast assertions to prove that your Agent Harness intercepts the call, checks the executing user’s identity, and blocks the action.

Here is how you can implement and test this security pattern in both Python and Go.

Python: Proving RBAC & Tool Governance

In Python, we use a Protocol to define our LLM dependency, and then build a Stub that intentionally requests unauthorized actions.

from typing import List, Optional, Protocol
from pydantic import BaseModel
# Define standard tool call response formats
class ToolCall(BaseModel):
   id: str
   name: str
   arguments: dict
class Response(BaseModel):
   content: Optional[str] = None
   tool_calls: Optional[List[ToolCall]] = None
# Define the LLM Interface
class LLMClient(Protocol):
   def generate(self, prompt: str) -> Response:
       ...
# Implement the Stub Model for Security Testing
class StubLLM:
   def generate(self, prompt: str) -> Response:
       # 1. Standard authorized action
       if "MOCK_WEATHER_TOOL" in prompt:
           return Response(
               tool_calls=[ToolCall(id="call_1", name="get_weather", arguments={"location": "London"})]
           )

       # 2. Malicious / Unauthorized action for Governance testing
       if "MOCK_UNAUTHORIZED_DELETE" in prompt:
            return Response(
               tool_calls=[
                   ToolCall(
                       id="call_malicious_999",
                       name="delete_user_account",
                       arguments={"user_id": "admin_01"} # The LLM is trying something dangerous!
                   )
               ]
           )
       return Response(content="This is a stubbed standard response.")

The Security Unit Test (pytest): With the stub in place, we can test that our Agent correctly parses the dangerous tool call, evaluates the user’s identity, and blocks the execution of the real local Python function.

import pytest
def test_agent_rbac_blocks_unauthorized_tool_execution():
# Arrange: Inject our deterministic stub into the Agent
stubbed_llm = StubLLM()
# Initialize our agent harness with a heavily restricted "guest" identity
agent = Agent(llm_client=stubbed_llm, user_role="guest_user")
# Act: Send the trigger that forces our stub to attempt a destructive tool call
response = agent.run("Please MOCK_UNAUTHORIZED_DELETE")
# Assert: Verify the Agent's governance harness intercepted the call,
# checked the "guest_user" identity, and blocked the REAL local tool.
assert response.status == "blocked_by_policy"
assert response.tool_executed is None
assert "Insufficient permissions to execute delete_user_account" in response.error_message

Go: Validating OAuth & Identity Boundaries

In Go, this pattern shines for validating complex OAuth scopes or identity propagation in multi-agent networks.

package llm
import (
   "encoding/json"
   "strings"
)
type ToolCall struct {
   ID string `json:"id"`
   Name string `json:"name"`
   Arguments []byte `json:"arguments"`
}
type Response struct {
   Content string `json:"content,omitempty"`
   ToolCalls []ToolCall `json:"tool_calls,omitempty"`
}
type Client interface {
   Generate(prompt string) (*Response, error)
}
type StubLLM struct{}
func NewStubLLM() *StubLLM {
   return &StubLLM{}
}
func (s *StubLLM) Generate(prompt string) (*Response, error) {
   // Simulate an Agent trying to access a secure internal system via MCP
   if strings.Contains(prompt, "MOCK_ACCESS_SECURE_VAULT") {
       args, _ := json.Marshal(map[string]string{"secret_id": "prod_db_password"})

       return &Response{
           ToolCalls: []ToolCall{
               {
                   ID: "call_vault_123",
                   Name: "read_secure_vault",
                   Arguments: args,
               },
           },
       }, nil
   }
   return &Response{Content: "Standard response"}, nil
}

The Security Unit Test (testing): We write a test to guarantee that if the LLM decides to hit the vault, the Agent harness forces the underlying tool to respect the provided OAuth context.

package agent_test
import (
"testing"
"errors"
)
func TestAgentEnforcesOAuthScopes(t *testing.T) {
// Arrange: Initialize the agent with the Stub model
stub := llm.NewStubLLM()
// Create an agent context with a standard user OAuth token (No Vault Access)
mockOAuthContext := identity.NewContext(identity.WithScope("read:public"))
myAgent := agent.New(stub, mockOAuthContext)
// Act: Trigger the LLM to request a highly privileged tool call
result, err := myAgent.Run("I need you to MOCK_ACCESS_SECURE_VAULT")
// Assert: Verify the harness evaluated the tool against the OAuth scope and blocked it
if err == nil {
t.Fatalf("CRITICAL SECURITY FAILURE: Agent executed secure vault tool without proper OAuth scope")
}
if !errors.Is(err, ErrUnauthorizedToolExecution) {
t.Errorf("Expected authorization error, got: %v", err)
}
if result.ExecutedTool == "read_secure_vault" {
t.Errorf("The real tool was executed despite lack of permissions!")
}
}

Why Security & Governance Teams Love This Architecture

By treating the LLM like any other untrusted external dependency, we achieve total control over our agent’s testing environment.

Auditable Proof of Governance: You now have concrete CI/CD tests proving that your agent respects OAuth scopes, RBAC, and identity guardrails. You aren’t just hoping the model behaves; you are proving the harness defends against it when it doesn’t.
Tests the Real Agent Harness: Because the LLM returns a perfectly formatted tool call request, your application code actually executes its real security middleware. You validate the entire execution loop, not just a mocked final answer.
Lightning Fast & Free: You can run thousands of these security edge-case tests in milliseconds without spending a dime on API tokens or exposing secrets in your CI pipeline.
Force Prompt Injection Scenarios: You can easily stub the LLM to return tool arguments containing SQL injection or XSS payloads to ensure your local tools sanitize inputs provided by the AI.

The Trade-Offs: What the Stub Model DOESN’T Test

As powerful as this architecture is for testing your infrastructure, it’s important to acknowledge that it is not a silver bullet. There are two major things the Stub Model cannot test:

It tests the pipes, not the brain: The stub proves your system can correctly block a malicious tool call, but it does not test whether your system prompt is resilient to prompt injection in the first place. You still need LLM-as-a-judge pipelines and continuous evaluation frameworks to test your model’s actual reasoning capabilities.
Vendor Schema Drift: If OpenAI, Anthropic, or Google update the shape of their underlying JSON tool-call schema, your hardcoded stub tests will still pass with flying colors while your production environment crashes. You still need a handful of real, end-to-end (E2E) smoke tests running against the live API on a nightly basis to catch vendor drift.

Beyond the Chatbot: Engineering for Agency

If you are building complex systems, delegating between autonomous agents, or integrating internal APIs via MCP, you cannot afford to have untested authorization loops.

By treating the LLM like any other untrusted external dependency, we achieve total control over our agent’s testing environment. We gain auditable proof of governance , ensuring we can run thousands of CI/CD tests in milliseconds without exposing secrets or spending a dime on API tokens.

If you are building complex systems, delegating between autonomous agents, or integrating internal APIs via MCP, you cannot afford to have untested authorization loops.

Do yourself a favor: Stub your LLMs.

Stubbing your LLM proves the guardrails work in test. TAG enforces them in production, giving you continuous visibility into every agent action, authorization decision, and policy enforcement event across your entire organization. Talk to us about TAG.

The post How to Stub LLMs for AI Agent Security Testing and Governance appeared first on Tigera - Creator of Calico.

Introducing Calico Load Balancer and Seamless VM-to-Kubernetes Migration

Alister Baroi — Mon, 23 Mar 2026 07:01:36 +0000

SAN JOSE, Calif., March 23, 2026 — Tigera, the creator and maintainer of Project Calico, today announced a major expansion of its Unified Network Security Platform for Kubernetes, aimed at helping enterprises consolidate infrastructure and accelerate the migration of legacy workloads to cloud-native platforms.

The new capabilities include:

Calico Load Balancer: A high-performance, eBPF-based, software-defined load balancer that replaces expensive, rigid hardware appliances with a Kubernetes-native solution.
Seamless VM-to-Kubernetes Migration: Advanced Layer 2 (L2) networking support eliminates migration friction by allowing virtual machines to move into Kubernetes clusters without changing their original IP addresses or existing VLAN dependencies.

These innovations help organizations tackle the rising “complexity tax” in managing high-scale Kubernetes clusters and provide a high-velocity path to consolidate virtual machines and containers into a single, standardized platform.

“The industry is at a breaking point where the operational overhead of managing legacy hardware and fragmented VM silos is no longer sustainable. By building a distributed load balancer into the fabric of Calico, and introducing live migration support to move VMs to Kubernetes, we are giving platform teams the power to innovate rather than spend hours managing and troubleshooting.”

— Ratan Tipirneni, president and CEO, Tigera

Eliminating Hardware Bottlenecks: The Calico Load Balancer

On-premises Kubernetes teams have traditionally relied on legacy hardware appliances to expose services, creating significant operational overhead and rigid dependencies between networking and platform teams. These external solutions often lack visibility into Kubernetes service context, do not scale horizontally, and require manual coordination for even basic software upgrades.

Tigera is disrupting this model with the Calico Load Balancer, a modern, software-defined solution built natively into the Calico platform. By transforming existing cluster nodes into a distributed, session-stable load-balancing tier, platform teams gain full control over service advertisement and configuration using the same Kubernetes workflows they already use.

This Kubernetes-native innovation delivers several critical advantages:

Session Persistence for Stateful Apps: A high-performance, eBPF-based data plane ensures that latency-sensitive, stateful applications like Kafka or RabbitMQ maintain active connections even during node failures or changes in network paths.
Graceful Node Restarts: Platform teams can mark nodes for maintenance and take them offline without impacting user sessions, preventing lost transactions for critical business services.
Reduced Latency: By enabling return traffic to take a shorter path back to the client, the solution reduces latency compared to traditional appliances where traffic must pass through the same central hardware twice.
Simplified Scaling: The load balancer scales horizontally with the cluster; adding more nodes automatically adds more load-balancing capacity without vertical scaling limits or vendor upgrade cycles.
Self-Service and Declarative Control: Configuration is handled through standard Kubernetes resources and GitOps workflows, removing cross-team bottlenecks and eliminating the need for tickets or separate management consoles.

Technical Deep Dive: Simplifying network traffic management with eBPF and the Calico Load Balancer.

The Great Migration: Seamlessly Moving VMs to Kubernetes

Historically, migrating virtual machines to Kubernetes meant a forced network redesign because VMs rely on static IP addresses and legacy Layer 2 VLAN configurations. Tigera’s new L2 networking support removes this friction.

Zero-Change Migration: VMs can be migrated from VMware to Kubernetes (KubeVirt) while keeping their original IP addresses, ensuring business continuity for applications with hardcoded dependencies.
Instant Security Upgrade: Once migrated, VMs are automatically protected by Calico’s microsegmentation, allowing organizations to retire costly third-party security tools.

Once migrated, the VMs in Kubernetes benefit from Calico’s advanced network security and observability capabilities. For users familiar with technologies like VMware NSX, Calico provides NSX-like functionality, including software-defined networking, microsegmentation, a workload-based firewall, and egress gateways for VMs running in Kubernetes.

Step-by-Step Guide: Lift and shift VMs to Kubernetes with Calico L2 bridge networks.

One Platform for Networking, Security, and Observability

The new Calico Unified Network Security Platform provides platform teams with a single, operator-managed solution. This allows teams to gain consistent network policy enforcement across L3-L7 layers with unified visibility, eliminating the overhead of managing multiple tools. Calico works consistently across any Kubernetes distribution, virtual machines, and bare-metal servers, ensuring enterprises can avoid vendor lock-in.

About Tigera

Tigera provides Calico, a unified network security and observability platform to prevent, detect, and mitigate security breaches in Kubernetes clusters. Tigera’s open-source offering, Calico Open Source, is the most widely adopted container networking and security solution. Powering more than 100M containers across 8M+ nodes, Calico is supported across all major cloud providers and Kubernetes distributions.

Media Contact

Media relations, Tigera

contact@tigera.io

Next Steps: Get Hands-on with These Innovations

Learn more about Calico Load Balancer and L2 networking support within the Calico ecosystem. Whether you are looking to optimize troubleshooting, reduce hardware dependency, or accelerate your VM migration, we provide the tools to get started today.

Experience the Platform: Start a free trial of Calico Cloud
Personalized Deep Dive: Request a technical demo with our engineering team

Attending KubeCon Amsterdam Mar 23-26, 2026? Stop by the Tigera booth #400 to learn more about these features.

The post Introducing Calico Load Balancer and Seamless VM-to-Kubernetes Migration appeared first on Tigera – Creator of Calico.