Jovan Chan

Posted on Jun 2 • Originally published at runaihome.com

Best NVMe SSD for Local AI in 2026: Model Load Speed Benchmarks (Gen 3 vs Gen 4)

#nvme #ssd #storage #localai

This article was originally published on runaihome.com

When building a local AI workstation, storage gets treated as an afterthought. You spend weeks choosing the right GPU, obsess over VRAM tiers, and debate PCIe 4.0 vs 5.0 for your GPU slot — then throw in whatever 2TB drive happened to be on sale. That's the wrong order of operations.

The SSD is where your model lives until the moment you fire up a prompt. Every cold start, every model switch, every reboot means the inference engine pulls tens of gigabytes off disk into system RAM or VRAM before the first token generates. On the wrong storage, a 40GB model makes you wait over a minute. On a good NVMe drive, it's under 15 seconds.

Here's the full breakdown: how storage actually affects local AI workloads, where the bottleneck is, and exactly which drives to buy in 2026.

Why Storage Is the Bottleneck Nobody Benchmarks

When you run ollama run llama3.3:70b or load a model in LM Studio, the inference engine reads the entire model file from disk before inference begins. For a 70B model quantized to Q4_K_M format, that's approximately 40 GB of sequential read every cold start.

At SATA SSD speeds, that read takes over a minute. On a spinning hard drive, closer to five minutes — before the first token ever appears.

The core math: model file size ÷ storage sequential read speed = theoretical minimum load time. In practice, inference engines like llama.cpp don't do pure sequential reads — they use memory mapping (mmap), where the OS handles I/O in chunks alongside metadata parsing and VRAM allocation. That overhead adds latency beyond raw throughput. But the ratio between storage tiers holds firmly. The CraftRigs benchmark confirms it directly: a 40GB model loads in 70+ seconds on SATA versus under 15 seconds on Gen 4 NVMe — a real-world 5× gap.

This matters more than most users realize. Local AI users swap models constantly: a fast 7B for quick answers, a slower 70B for multi-step reasoning, an SDXL checkpoint for image generation. Each switch is a full disk read. If every switch costs 70 seconds, you stop doing it. If it costs 10 seconds, you do it freely.

The Model Size Reality

Before comparing storage tiers, it's worth knowing what file sizes you're actually dealing with. These are approximate sizes for Q4_K_M GGUF quantization, which is the most common format for local inference:

Model	Quantization	Approximate File Size
Llama 3.2 3B	Q4_K_M	~2 GB
Llama 3.3 8B	Q4_K_M	~5 GB
Phi-4 14B	Q4_K_M	~9 GB
Qwen 3 30B	Q4_K_M	~18 GB
Llama 3.3 70B	Q4_K_M	~40 GB
Mistral Large 123B	Q4_K_M	~70 GB

The 70B models are where storage becomes a genuine workflow tax. At 40 GB, you're asking your drive to work hard every session startup. Add ComfyUI alongside your LLM — SDXL checkpoints are 6–7 GB, Flux.1 Dev is 23 GB — and a single session setup can push 60 GB or more off disk.

If you also keep multiple quantization variants of the same model (Q4_K_M for speed, Q8_0 for quality), that one 70B model becomes 100+ GB across formats. Storage fills faster than people expect.

Storage Type Comparison: Load Time for a 40GB Model

The load times below are based on advertised sequential read speeds and real-world benchmarks. Actual load times run 20–40% longer than pure-throughput math predicts, due to inference engine overhead. The ratios between tiers are consistent.

Storage Type	Sequential Read	Theoretical Load (40GB)	Estimated Real-World Load
Spinning HDD	~150 MB/s	~270 sec	5–8 minutes
SATA SSD	~550 MB/s	~75 sec	70–90 seconds
NVMe Gen 3 (PCIe 3.0)	~3,500 MB/s	~12 sec	18–25 seconds
NVMe Gen 4 (PCIe 4.0)	~7,000–7,450 MB/s	~6 sec	10–15 seconds
NVMe Gen 5 (PCIe 5.0)	~14,000–14,900 MB/s	~3 sec	8–12 seconds

The SATA-to-Gen-4 jump is transformative. The Gen-4-to-Gen-5 jump is marginal.

On HDDs: If you're still storing models on a spinning drive, this is your most urgent hardware upgrade — more impactful than most GPU bumps. Five to eight minutes per load destroys any workflow that involves model switching. A $120 Gen 4 NVMe fixes it permanently.

On SATA SSD: You feel this every session. The upgrade to Gen 4 NVMe recovers 60+ seconds per model load. If you switch models 5–10 times a day, that's 5–10 minutes of dead time you get back, compounding daily.

On Gen 3 NVMe: Acceptable, not optimal. You're at 18–25 seconds for a 70B load — workable if you're not switching models frequently. Upgrading to Gen 4 saves another 5–10 seconds, worth doing if you're replacing a drive for capacity anyway.

Why the Gen 4-to-Gen 5 Gap Is Smaller Than You'd Expect

Here's where the math gets interesting. Gen 5 drives are roughly 2× faster on paper — 14,000–14,900 MB/s versus 7,000–7,450 MB/s for Gen 4. But the practical cold-start improvement for LLM loading is only 2–4 seconds on a 40GB model.

Why the mismatch? Real-world loading speed through the Python/llama.cpp API tops out at roughly 1,300–2,000 MB/s, regardless of whether your drive can do 7,000 or 14,000 MB/s. The bottleneck shifts to:

Memory-mapping overhead in the OS
Layer-by-layer allocation as the model loads into VRAM
Metadata parsing and weight verification in the inference engine

Both Gen 4 and Gen 5 drives saturate the software's ability to consume data. The hardware is no longer the limit — the inference engine is. That's why the Samsung 9100 Pro (Gen 5, 14,800 MB/s) loads a 7B model in 2.6 seconds while a Gen 4 drive doing the same task might take 3.5–4 seconds. For a 70B model, the gap grows to maybe 3–5 seconds in total.

At a $60–$90 premium over Gen 4 for a 2TB drive, that math doesn't favor Gen 5 for LLM-only workloads.

Drive Recommendations for Local AI Workstations

All prices are as of May 2026 and will fluctuate — verify before purchasing.

Best All-Round: Samsung 990 Pro 2TB (~$150)

7,450 MB/s sequential read, 6,900 MB/s write. The 990 Pro is the most mature, well-tested Gen 4 drive in the enthusiast market, with a thermal design that holds sustained throughput without throttling. Available at Amazon and Newegg. If you want the proven option and don't want to think about it, buy this.

Best Value: WD Black SN850X 2TB (~$156)

Within $10 of the Samsung at 7,300 MB/s read. Real-world load times are indistinguishable from the 990 Pro. WD has a strong track record in sustained workloads, and the SN850X is available at B&H Photo and Amazon. Buy whichever is cheaper on the day you're ordering.

Budget Gen 4: Kingston KC3000 2TB (~$120)

7,000 MB/s sequential read, $30 less than the premium options. For pure sequential model loading — which is exactly what this use case demands — it matches the top-tier drives. The controller is less consistent under heavy sustained writes, but model loading is read-dominated. Solid choice if the savings go toward more drive capacity elsewhere.

High Capacity: Sabrent Rocket 4 Plus 4TB (~$280)

If you store multiple 70B variants, ComfyUI checkpoints, and a Stable Diffusion model library, 2TB fills up fast. The Rocket 4 Plus at 7,100 MB/s gives you Gen 4 speed with real capacity headroom, at a price that beats most Gen 5 2TB options. The right choice if you're constantly juggling model files.

Skip Unless You Have Other Use Cases: Crucial T705 2TB (~$220) or Samsung 9100 Pro 4TB (~$549)

Both are excellent drives. Neither meaningfully speeds up LLM cold-start times compared to Gen 4. Recommended only if your build also handles video editing, large dataset processing, or ot

DEV Community