
The migration ran clean. The VM came up on AHV within the expected window. Storage latency was nominal. The health check returned green. The team marked it complete, moved to the next workload, and closed the cutover ticket.
Seventy-two hours later, a service desk ticket arrived. Intermittent authentication failures on that VM. Not consistent — sometimes fine, sometimes not. The on-call engineer checked the obvious things: network connectivity, DNS resolution, service status. All healthy. The VM was healthy. The monitoring said healthy.
The failure didn't surface fully until a scheduled GPO refresh ran four days post-cutover and Kerberos authentication broke hard.
Post-incident analysis identified the root cause as time drift introduced during the VMware Tools replacement. Nobody had put time synchronization verification on the migration checklist — because time sync had always been a VMware Tools responsibility, and VMware Tools had been replaced as part of the migration procedure. The checklist showed "VMware Tools replaced ✅." The checklist passed. The implicit dependency on VMware Tools for time authority wasn't on the checklist at all.
This is the vmware migration issues pattern most cutover playbooks don't cover — not compute portability, but identity continuity.
The Failure Chain
The sequence is specific enough to be worth walking through precisely, because each step looks like a different problem until you see them in order.
Step 1 — VM migrates successfully to AHV or KVM. Compute layer: complete. Storage: attached. Network: connected. The migration tooling reports success. This is accurate.
Step 2 — VMware Tools is removed and replaced with the target hypervisor's guest agent. This is the correct procedure and the checklist item passes. What isn't documented: VMware Tools was managing time synchronization between the guest and the ESXi host. The replacement agent has different time sync behavior — and on many AHV and KVM deployments, the guest's NTP configuration was inheriting from VMware Tools rather than maintaining an independent NTP source.
Step 3 — Time drift appears after reboot. Not immediately visible. The guest clock drifts gradually — often only a few minutes in the first hour. Monitoring shows the VM as healthy because the monitoring checks process health and network reachability, not clock skew against domain time.
Step 4 — Kerberos skew exceeds the 5-minute tolerance. Kerberos authentication has a hardcoded default clock skew tolerance of 5 minutes. When the guest clock drifts past that threshold, Kerberos begins rejecting authentication tickets. The failures are intermittent because drift is gradual and the skew crosses the threshold inconsistently depending on when tickets are being issued and validated.
Step 5 — AD authentication fails intermittently. Not constantly — which makes it significantly harder to diagnose. Constant failures point immediately to a configuration error. Intermittent failures look like a network problem, a service issue, or a transient event. The VM is healthy. The domain controller is healthy. The connection is healthy. The clock is broken.
Step 6 — Certificates tied to the hostname or SPN begin failing renewal. Certificate renewal operations that depend on Kerberos-authenticated connections to the CA start failing silently. This doesn't surface immediately because existing certificates are still valid — the failure appears when renewal is attempted.
Step 7 — Monitoring still shows the VM as healthy. Compute metrics are normal. Process health is normal. Network reachability is normal. Nothing in the standard monitoring stack is measuring Kerberos ticket validity or certificate renewal success rates.
Step 8 — Failure surfaces during GPO refresh, scheduled task execution, or service restart. GPO application requires authenticated domain communication. Scheduled tasks running under domain service accounts require valid Kerberos tickets. Service restarts trigger re-authentication against the domain.
Step 9 — Post-incident analysis struggles to connect the failure to the migration. The cutover was days ago. The VM has been running. "The migration ran clean" is the answer everyone gives, because the migration checklist passed.
What the Migration Checklist Missed
The checklist wasn't wrong. "VMware Tools replaced ✅" is correct procedure. The problem isn't that the checklist item failed — it's that the checklist didn't capture what VMware Tools was implicitly responsible for beyond its documented feature set.
Time synchronization is the most common implicit dependency, but it's not the only one. VMware Tools mediates guest-hypervisor interactions that most migration checklists treat as binary: installed or not installed. The functional dependencies it was maintaining — time authority, some certificate operations, guest identity signals to the control plane — aren't listed as VMware Tools dependencies in most runbooks because they were never explicitly configured. They were defaults that worked because VMware Tools was present.
The Identity Continuity Gap
This failure pattern has a name: the Identity Continuity Gap — the operational gap between workload portability and trust portability during virtualization migrations.
Workload portability is what migration tooling measures: the VM can boot, run, and serve traffic on the new hypervisor. Trust portability is what migration tooling doesn't measure: the VM's identity relationships — its standing with the domain controller, its certificate chain validity, its time authority, its SPN registrations — are intact and functional on the new hypervisor.
A migration can achieve complete workload portability and zero trust portability simultaneously. The VM boots. The checklist passes. The identity layer is broken in ways that only surface under specific operational conditions.
What Trust Portability Actually Requires
Five verification steps that belong on every migration checklist and aren't on most of them:
Time synchronization verification before cutover confirmation. Verify that the guest clock is synchronized to domain time within Kerberos tolerance after the guest agent replacement — before marking the migration complete.
Kerberos skew tolerance testing post-reboot. Run an explicit Kerberos authentication test after the first reboot on the new hypervisor. A successful kinit or equivalent confirms time authority is intact.
SPN audit independent of VMware Tools. Service Principal Names registered for the migrated VM should be verified post-cutover.
Certificate chain validation independent of the old hypervisor. Validate that the renewal process can complete successfully against the CA from the new hypervisor — not just that the current certificate is valid.
Identity reconciliation checkpoint as a migration gate. "VM has successfully completed a Kerberos-authenticated domain operation after migration" — not just "VM is running and responding to health checks."
Architect's Verdict
The migration succeeded at the compute layer and failed at the trust layer because the architecture treated identity as attached to the VM rather than attached to the operational control plane.
That framing is the useful one for post-mortems: this wasn't a migration failure, it was an identity architecture assumption that the migration exposed. The VM had always depended on VMware Tools to maintain its time authority and by extension its domain trust relationships. That dependency was invisible because VMware Tools was always present. The migration removed it — and the identity layer failed on a deferred schedule, in ways that looked like network problems and transient events until the pattern became clear.
The checklist item was correct. The checklist was incomplete. The gap between those two statements is where most vmware migration issues at the identity layer live — not in what was verified, but in what was never written down as a dependency in the first place.
Additional Resources
- What Breaks First After You Leave VMware — post-cutover failure taxonomy
- The "Lift-and-Shift to KVM" Fallacy — implicit dependencies and migration complexity
- The Skills Gap Is the Real VMware Exit Risk — why identity expertise is the resource migration teams are short on
- Microsoft: Kerberos Authentication Overview — authoritative Kerberos clock skew reference
Originally published at rack2cloud.com


Top comments (0)