I spent four hours last Tuesday troubleshooting why a new GPU node couldn't reach the MLflow registry during a training run. The ACI fabric was reporting the endpoint learned. The policy contract showed permit. But packets died silently somewhere between leaf switches. The root cause? A stale endpoint entry in the COOP database that the APIC controller hadn't reconciled. I fixed it by clearing the endpoint from the CLI, bypassing the abstraction layer entirely.
That incident crystalized something I'd been seeing across three data center builds: when the controller's model of the network diverges from the actual forwarding state, you end up working around the abstraction, not through it. You SSH to the leaf switch and run show commands that reveal what's really happening in hardware. At that point, the controller is adding latency, not value.
The Real Tradeoff Nobody Talks About
ACI's pitch is clean: declare your intent through a GUI or API, and the fabric converges to that state. The APIC controller translates your application profiles, bridge domains, and contracts into the necessary VXLAN, EVPN, and policy constructs. You shouldn't need to understand MP-BGP route types or VNI allocation.
But here's what actually happens in production: you still need to understand those primitives when something breaks. The abstraction doesn't eliminate complexity; it relocates it behind an API that makes certain operations harder. Want to trace a specific MAC/IP binding through the fabric? You're running moquery against the APIC's object store and correlating it with CLI output from the leaf. Want to integrate with an existing BGP-based underlay? You're fighting the APIC's assumptions about how fabric routing should work.
NX-OS VXLAN EVPN, in contrast, gives you direct access to the forwarding primitives. You configure BGP EVPN address families, define VNI-to-VLAN mappings, and control route advertisement explicitly. There's no translation layer. What you configure is what runs in hardware.
Where This Shows Up in AI Infrastructure
GPU clusters amplify every network design decision. When you're running distributed training across 64 A100 nodes, a single packet drop during an NCCL all-reduce can stall the entire job. You need:
- Deterministic forwarding paths with consistent latency
- Lossless Ethernet with PFC properly scoped to GPU traffic
- Fast convergence when a leaf switch or link fails
- Visibility into actual queue depths and buffer utilization
ACI can deliver all of this, but the configuration path is indirect. You define QoS classes in the APIC, which generates MQC policies on each leaf. You enable PFC through a fabric access policy, which the APIC pushes as platform-specific DCBX settings. When you need to verify that PFC PAUSE frames are actually being sent for CoS 3 traffic on a specific port, you're back on the switch CLI, running:
switch# show interface ethernet 1/49 priority-flow-control
And if the output doesn't match what the APIC says is configured, you're troubleshooting two systems instead of one.
With NX-OS VXLAN EVPN, the QoS configuration is direct:
class-map type qos match-all gpu-rdma
match cos 3
policy-map type qos gpu-qos
class gpu-rdma
set qos-group 3
priority level 1
interface Ethernet1/49
service-policy type qos input gpu-qos
priority-flow-control mode on
priority-flow-control watch-dog-interval 200
You write the exact policy you need. You see it in show run. You verify it in show policy-map interface. There's no model translation to debug.
The Kubernetes Integration Gap
Most AI infrastructure runs on Kubernetes now, and Kubernetes networking has strong opinions. CNI plugins like Calico, Cilium, and Antrea expect to control pod networking — IP allocation, routing, and increasingly, network policy. They assume the physical network provides L3 reachability, typically via BGP.
ACI's CNI plugin tries to bridge these worlds by mapping Kubernetes constructs to ACI objects. A namespace becomes an EPG. A network policy becomes a contract. But this creates tight coupling: your cluster lifecycle is now tied to the APIC's API and its upgrade schedule. I've seen teams delay Kubernetes upgrades by six months waiting for a compatible ACI CNI version.
The alternative pattern I'm seeing: run NX-OS VXLAN EVPN in the fabric, peer each leaf switch with the Kubernetes nodes using eBGP, and let the CNI plugin handle pod networking. Calico's route reflector mode works perfectly here:
apiVersion: projectcalico.org/v3
kind: BGPConfiguration
metadata:
name: default
spec:
logSeverityScreen: Info
nodeToNodeMeshEnabled: false
asNumber: 65001
serviceClusterIPs:
- cidr: 10.96.0.0/12
---
apiVersion: projectcalico.org/v3
kind: BGPPeer
metadata:
name: leaf-1
spec:
peerIP: 10.0.0.1
asNumber: 65000
Each Kubernetes node peers with its leaf switches. Pod routes are advertised via BGP. The fabric treats them like any other /32. When a pod moves, Calico withdraws the old route and advertises the new one. Convergence is sub-second. No controller in the middle.
The TCO Equation Has Changed
ACI's total cost of ownership used to be defensible because the APIC automation saved operational effort. But in 2026, the baseline assumption is Infrastructure as Code. You're managing everything through Terraform or Ansible anyway. The question isn't whether you have automation; it's which primitives your automation targets.
Targeting ACI means Terraform providers that wrap the APIC API, which abstracts the actual network config. Your state files contain EPGs and contracts. Your pipeline has an APIC dependency — it has to be reachable, authenticated, and healthy for changes to apply.
Targeting NX-OS EVPN means Terraform providers that generate CLI commands or use NETCONF/gNMI. Your state files contain the actual config. Your pipeline pushes directly to devices. You can stage and test config in a text file before applying it. There's no controller to version-match or license separately.
License cost is the obvious part: ACI requires APIC controllers (virtual or physical) with their own licensing. NX-OS VXLAN EVPN runs on the same switches with base NX-OS licensing. But the less obvious cost is operational: every abstraction layer is another integration point to maintain, another API version to track, another component in your blast radius when you upgrade.
What This Means for Your Next Build
If you're designing a leaf-spine fabric in 2026, especially for GPU-dense AI infrastructure, start with these questions:
- Do you need the APIC's policy model, or are you comfortable managing EVPN/VXLAN primitives directly through IaC?
- How tightly coupled do you want your physical network to be with your container orchestration layer?
- When troubleshooting, do you prefer working through a controller API or directly on device CLI?
For most teams I'm working with, the answers point toward NX-OS VXLAN EVPN. They're already managing network config as code. Their Kubernetes CNI handles pod networking. They want the shortest path from intent to forwarding plane, especially when debugging at 2 AM.
ACI isn't dead, but its value proposition has narrowed. It still makes sense if you have a large operational team that prefers GUI-driven workflows, or if you're deeply integrated with Cisco's broader intent-based networking stack. But for infrastructure engineers building modern GPU clusters and Kubernetes platforms, the simpler path is increasingly the better one.
The network is becoming infrastructure code. The abstraction layers that hide the primitives are becoming friction. And the teams who understand EVPN/VXLAN directly are shipping faster than the ones waiting for controller APIs to catch up.
This post is an excerpt from Practical AI Infrastructure Engineering — a production handbook covering Docker, GPU infrastructure, vector databases, and LLM APIs. Full book with 4 hands-on capstone projects available at https://activ8ted.gumroad.com/l/ssmfkx
Originally published at fivenineslab.com
Top comments (0)