NTCTech

Posted on May 4 • Originally published at rack2cloud.com

The "Lift-and-Shift to KVM" Fallacy

#vmware #kvm #devops #infrastructure

The VM conversion completed without errors. Every workload made it across. The migration dashboard showed green, the project lead closed the ticket, and the consultants left the building.

Three weeks later, backup verification jobs are silently failing. Monitoring dashboards are dark. The on-call team is operating without baselines. Nobody knows what normal looks like on the new platform.

The VM conversion worked. The migration did not.

This is the lift-and-shift KVM fallacy — and it isn't a KVM problem. It's a scoping problem. Most VMware-to-KVM migration plans capture the visible dependency — the hypervisor — and treat everything built around it as someone else's project. The Operating Model Gap is what that assumption leaves behind.

What Lift-and-Shift Actually Moves

Lift-and-shift KVM moves compute. Disk images transfer. Network definitions port. VM configurations recreate on the other side. From a data-plane perspective, the migration looks complete because the workloads are running.

What does not move:

Operational runbooks referencing vCenter constructs
Backup architecture built against VADP APIs
Monitoring thresholds calibrated to vSphere metrics
Provisioning workflows targeting vCenter endpoints
Snapshot behavior assumptions encoded in recovery procedures
Storage policy logic tied to vSAN semantics
Identity and access models mapped to vCenter RBAC
Operator muscle memory built over years of vCenter navigation None of this appears in the migration plan. All of it breaks after cutover.

The Operating Model Gap is the distance between what the migration plan captured and what the platform actually required to function. Every item in that list is a component of the operating model. The hypervisor conversion touches none of them.

VMware Was Never Just the Hypervisor

The framing that produces lift-and-shift KVM plans is this: VMware equals ESXi. Replace ESXi with KVM. Migration complete.

That framing is wrong. VMware was never ESXi. VMware was the control plane your entire operating model was built around.

| What the plan says | What actually changes |
|---|---|
| ESXi → KVM | vCenter (lifecycle and provisioning control) |
| | vMotion semantics (live migration behavior) |
| | vSAN (storage abstraction and policy model) |
| | NSX (network policy and microsegmentation) |
| | vROps / vRealize (observability and alerting logic) |
| | VADP (backup API framework) |
| | DRS (scheduling and placement policy) |
| | Snapshot behavior (application-consistent logic) |

A VMware environment is not a hypervisor with add-ons. It is an integrated control surface where compute scheduling, storage policy, network segmentation, observability, and recovery operations all converge. When you replace ESXi with KVM, every one of those layers needs a replacement or a rebuild — and unlike ESXi, KVM does not ship them included.

KVM is a kernel module. The management plane, storage architecture, network abstraction, and observability stack are your responsibility to assemble, integrate, and operate. That assembly is the migration work most lift-and-shift plans never scope.

The Operating Model Test: If vCenter disappeared tomorrow, what percentage of your operating model disappears with it?

For most VMware shops, the honest answer is somewhere between 60 and 90 percent. That percentage is the scope of what a lift-and-shift to KVM does not address.

The Three Failure Surfaces After Cutover

Lift-and-shift KVM migrations do not fail at cutover. They fail in operations. The failure surfaces are predictable, they appear in sequence, and they are almost never in the migration plan.

Failure Surface 1: Control Plane Replacement (Day 1–7)

You did not replace ESXi. You replaced vCenter.

vCenter was the operational control surface for provisioning new workloads, managing VM lifecycle, enforcing placement policy, controlling access, and targeting automation. When you move to KVM, vCenter is gone — and everything that pointed at it needs a new target.

The KVM ecosystem offers options: libvirt for direct management, Proxmox VE for a GUI-centric model, oVirt for a closer-to-vCenter experience, OpenStack for cloud-scale orchestration. Each is a different operating model. None is a drop-in replacement. The team that executed a lift-and-shift KVM migration and operated vCenter for a decade does not automatically know how to operate any of them under pressure at 2am.

This is the first stall point. Not because the management plane doesn't exist — it does — but because the operating model loses its control surface and the team has to rebuild operational confidence from scratch.

Failure Surface 2: Storage Semantics Collapse (Day 7–30)

You did not lose shared storage. You lost the storage abstraction your platform behavior depended on.

vSAN provided a distributed storage fabric with defined behavior around replication, failure domains, snapshot consistency, and policy-based placement. That abstraction encoded a set of assumptions your entire backup architecture, recovery procedures, and performance baselines were built against.

In a KVM environment, that abstraction is gone. You are now operating raw storage — whether Ceph, NFS, iSCSI, or local — and the behavior is different in ways that matter:

Snapshot behavior — application-consistent snapshot mechanics differ by storage backend; VADP is gone
Backup assumptions — protection jobs built against VADP APIs break immediately; rebuild is required
Performance characteristics — latency, IOPS, and throughput profiles differ between vSAN and Ceph under the same load pattern
Replication semantics — storage replication behavior and consistency guarantees are not equivalent
Failure domain logic — how the platform handles node loss differs from vSAN's policy model This is where migrations pass validation and fail under load. Workloads run. The environment looks healthy. The gaps appear during the first backup verification window, the first storage-intensive workload spike, or the first incident that requires a restore from a snapshot taken after cutover.

Failure Surface 3: Operational Signal Loss (Day 30+)

The workloads moved. The signals didn't.

VMware environments accumulate operational signal over years — dashboards calibrated to vROps metrics, alert thresholds tuned against vSphere counters, runbooks that reference specific vCenter constructs, capacity models built on historical data from the VMware telemetry stack. That signal is institutional knowledge encoded in tooling.

After a KVM migration, all of it is wrong. The old dashboards are meaningless because the metrics don't exist. The alert thresholds don't map because the counters are different. The runbooks reference objects that no longer exist. The on-call team is operating blind against a platform they don't have baselines for yet.

This is where Day 30 failure begins. Not a dramatic incident — a slow erosion of operational confidence, a growing number of "we're not sure what normal looks like" moments, and a steady accumulation of unresolved alerts the team has stopped trusting.

The observability rebuild is not a migration task. It is a post-migration operational project that takes weeks. It is almost never in the original migration scope.

When KVM Actually Fits

This is not a post about KVM being unsuitable for enterprise infrastructure. KVM is a legitimate hypervisor running production workloads at scale across some of the largest environments in the world. The question is not whether a lift-and-shift KVM approach works — it's whether your operating model is positioned for it.

KVM fits when the operating model already lives below VMware's abstraction layer. KVM is a Linux kernel module; operating it well means operating Linux well, at depth, under production pressure.

The signal that KVM is the right call:

Linux is already the operational center — the team thinks in hosts, not abstractions
Automation already targets infrastructure primitives directly, not vCenter APIs
The team has operated without VMware's abstraction layer under pressure — not in theory, in production
Sovereignty or cost physics make open-source the architectural requirement, not just the preference
Greenfield or container-adjacent workloads where VMware's abstraction was overhead, not operating leverage The distinction that matters is not "does the team know Linux." It is whether the team has operated infrastructure at the primitive layer under production pressure. A team with deep vCenter muscle memory that also has Linux skills is not the same as a team that has always operated below the abstraction. The former needs a longer runway and an explicit skills transition plan. The latter is ready.

Scope the Operating Model Before the Hypervisor

The correct sequencing for a lift-and-shift KVM migration is not: pick hypervisor, convert VMs, go live. It is: audit the operating model, scope the rebuild, then pick the hypervisor.

Four things to scope before the hypervisor decision is final:

01 — Management Plane Decision
Pick the management plane before the hypervisor. libvirt, Proxmox, oVirt, and OpenStack are not equivalent choices — each implies a different operational model, skill requirement, and automation target. The management plane decision determines the operating model. The hypervisor follows from it.

02 — Storage Semantics Audit
Map every storage dependency in the current environment — snapshot behavior, backup integration points, replication architecture, performance baselines. Document what the new storage backend provides and where the semantics differ. The delta is the rebuild scope. Treat it as a parallel workstream, not a migration task.

03 — Observability Rebuild Plan
Plan for zero operational signal on Day 1. The old dashboards are dead. The alert thresholds don't transfer. Build the observability stack against the new platform before workloads arrive — or accept that the first weeks post-cutover will be operationally blind.

04 — Skills Audit (Honest Version)
Not certifications. Not training course completions. Operational depth under pressure. Has the team operated storage at the Ceph or NFS primitive level during an incident? Have they managed KVM scheduling behavior under resource contention? Knowing how something works is not the same as having operated it when it breaks.

Architect's Verdict

KVM is not the problem. Treating the hypervisor as the platform is.

VMware was a control plane your entire operating model was built around. A lift-and-shift KVM project moves the compute layer and leaves the operating model — management plane, storage semantics, observability stack, backup architecture, and operational muscle memory — orphaned on the other side of the migration window.

The fallacy is not that KVM is harder than expected. The fallacy is scoping a lift-and-shift KVM project as a hypervisor migration when what you actually triggered is an operating model rewrite. Name it correctly before the project starts. Scope the rebuild explicitly. Run the Operating Model Test before you sign the migration plan.

If vCenter disappeared tomorrow and 70 percent of your operating model went with it, that 70 percent is the migration. The hypervisor swap is the easy part.

Originally published at rack2cloud.com

DEV Community