DEV Community

vlad
vlad

Posted on

I Spent $350 on an SSD to Fix a Configuration Problem (ESXi VM Migration Deep Dive)

I needed to migrate VMs off an ESXi host. Standard approach: mount an NFS datastore, use vmkfstools to copy VMDKs across.

Expected speed: ~100 MB/s.

Actual speed: 3–5 MB/s.

My conclusion: "The HDD is dying."

So I did what any reasonable person would do: ordered a 4 TB SSD for $350, cloned the entire VMFS datastore onto it, and prepared to write a guide about how SSDs are essential for ESXi migration.

Spoiler: I'm an idiot.

πŸ“Š Interactive benchmark charts β€” full matrix, sync penalty, IOPS data
πŸ™ Source β€” GitHub (raw data coming later)


The Aha Moment (That Took Too Long)

After getting the SSD and seeing 98 MB/s, I benchmarked the original HDD one more time for comparison.

Result: 82 MB/s.

Same HDD. Same VM. Same everything. Except now it was only 20% slower than the brand new $350 SSD.

I hadn't fixed the hardware. I'd accidentally fixed the configuration β€” and had no idea what I'd changed.


The Real Culprit: NFS Sync Mode

After two days of bisecting changes, I found it.

# /etc/exports β€” the line that cost me $350
/mnt/raid/esxi-import 192.168.0.0/24(rw,sync,no_root_squash,no_subtree_check)
#                                        ^^^^
#                                        THIS
Enter fullscreen mode Exit fullscreen mode

sync mode forces the NFS server to confirm every write is physically on disk before acknowledging it to the client. vmkfstools issues sequential writes and waits for each acknowledgement before sending the next chunk. On spinning rust, each fsync takes ~10ms. Do the math: 100 writes/sec Γ— ~50KB = 5 MB/s. Exactly what I was seeing.

The fix:

# /etc/exports
/mnt/raid/esxi-import 192.168.0.0/24(rw,async,no_root_squash,no_subtree_check)

# Apply β€” DON'T FORGET THIS STEP
exportfs -ra

# Verify it actually loaded
exportfs -v | grep async
Enter fullscreen mode Exit fullscreen mode

That last exportfs -ra is important. I've seen people edit the file and wonder why nothing changed.


The Numbers (16-Run Matrix)

Once I understood the real variable, I ran a proper benchmark β€” 16 combinations of source Γ— destination Γ— NIC speed Γ— MTU. Same VM throughout: 32 GB provisioned, ~19 GB allocated (thin/sparse).

Source Dest NIC MTU MB/s
HDD HDD 1G 1500 82.4
HDD HDD 1G 9000 85.7
HDD NVMe 1G 1500 82.9
HDD NVMe 1G 9000 85.8
HDD HDD 25G 1500 110.2
HDD HDD 25G 9000 114.5
HDD NVMe 25G 1500 122.9
HDD NVMe 25G 9000 127.4
SSD HDD 1G 1500 98.1
SSD HDD 1G 9000 102.6
SSD NVMe 1G 1500 98.6
SSD NVMe 1G 9000 103.0
SSD HDD 25G 1500 155.1
SSD HDD 25G 9000 157.4
SSD NVMe 25G 1500 285.6
SSD NVMe 25G 9000 289.7

All runs with NFS async. Sync numbers below.

The Sync Penalty β€” Measured Properly

Setup Async Sync Multiplier
HDD→HDD 1G 82.4 MB/s 6.5 MB/s 12.7×
HDD→HDD 25G 110.2 MB/s 6.5 MB/s 16.9×
SSD→NVMe 25G 289.7 MB/s 12.3 MB/s 23.5×

The faster the stack, the worse sync destroys it. On 25G with NVMe, sync costs you 23.5Γ— throughput. This is not a minor tuning knob.


The Bottleneck Hierarchy

Each layer proven by isolating variables:

1. NFS async mode     β€” 13-24Γ— improvement, always, no exceptions
2. Source drive       β€” 1.2Γ— on 1G, up to 2.3Γ— on 25G+NVMe  
3. Network speed      β€” 1.3Γ— with HDD, up to 2.9Γ— with SSD+NVMe
4. Destination drive  β€” ~0 on 1G, +84% on 25G+SSD
5. MTU 9000           β€” consistent +3-4 MB/s absolute, everywhere
Enter fullscreen mode Exit fullscreen mode

The hierarchy is strict. Fix layer N before upgrading layer N+1.

NVMe destination on 1G = +0.4 MB/s.

NVMe destination on 25G with SSD source = +130 MB/s.

Same hardware. Completely different result.


Step-by-Step: The Correct Way

NFS Server Setup

# /etc/exports
/mnt/raid/esxi-import 192.168.0.0/24(rw,async,no_root_squash,no_subtree_check)

exportfs -ra
exportfs -v | grep async   # verify
systemctl enable --now nfs-server
Enter fullscreen mode Exit fullscreen mode

ESXi NFS Mount

esxcli storage nfs add \
  -H 192.168.0.105 \
  -s /mnt/raid/esxi-import \
  -v migration_target

esxcli storage nfs list   # verify
Enter fullscreen mode Exit fullscreen mode

Migrate

vmkfstools -i /vmfs/volumes/source/vm/disk.vmdk \
  -d thin \
  /vmfs/volumes/migration_target/vm/disk.vmdk
Enter fullscreen mode Exit fullscreen mode

Validate

# Hash on ESXi while source is still mounted
md5sum /vmfs/volumes/source/vm/disk-flat.vmdk > /vmfs/volumes/migration_target/disk.md5

# Unmount source, then verify cold on Linux
md5sum -c disk.md5
Enter fullscreen mode Exit fullscreen mode

The Economics

Option Cost Speed gain
Fix NFS async $0 13-24Γ—
MTU 9000 $0 +3-4 MB/s
Upgrade to 25G NIC ~$150 used +30-150% (stack dependent)
SSD staging $80-350 +20% on 1G, +84% on 25G

Fix the config before buying hardware. Every time.


Lessons Learned

  1. Configuration mistakes cost more than hardware β€” $350 SSD, fixed by editing one word in a config file
  2. Measure before you buy β€” 5 minutes of benchmarking would have saved the SSD purchase entirely
  3. The checklist exists for a reason β€” exportfs -v | grep async takes 2 seconds
  4. Upgrades multiply, they don't add β€” NVMe destination means nothing without the network to feed it

Conclusion

If you take away one thing:

NFS sync mode will ruin your day, your week, and your migration project.

Enable async. Reload the config. Verify it loaded. Test it works. Then worry about whether you need an SSD.


Want the Full Data?

This article covers the fundamentals. The Advanced Deep Dive ($9) goes further:

  • Multi-target NFS fan-out (8 simultaneous copies, throughput plateau analysis)
  • tmpfs as migration target β€” eliminating destination storage from the critical path entirely
  • 2026 hardware update: dual 25G + iSCSI over NVMe RAID0
  • Parallelism scaling data and where it stops helping
  • Raw NVMe baseline (concurrent dd readers, queue depth analysis)
  • Interactive benchmark charts β€” full matrix, sync penalty, IOPS (free)
  • Source β€” raw data coming later

The destination storage is almost never the real constraint. The advanced article explains what is β€” and what to do about it.

get in touch: vlad [at] cipeople [dot] com β€” this is a solved problem.


All numbers from real hardware, real workloads. NFS async throughout unless explicitly marked sync. February 2026.

Top comments (0)