keeper

Posted on May 25

Windows vs Linux for Local AI: My Radeon 890M Has 96GB of RAM, but Windows Only Lets Me Use 3.5GB

#ai #hardware #linux #windows

The Setup

I run a Ryzen AI 9 HX 370 mini PC as my daily AI workstation. 96GB of system RAM, a Radeon 890M integrated GPU (gfx1150, RDNA 3.5). On paper, this should be a capable local inference box — 96GB is enough to run Gemma 4, Llama 3 70B, even Mixtral 8x22B at reasonable quantization.

There's just one problem: Windows refuses to let the iGPU use more than ~3.5GB of that memory.

I spent the better part of a week trying every workaround I could find. Here's what I learned — and why Linux won this round.

The Windows VRAM Wall

The Radeon 890M is an integrated GPU. It has no dedicated VRAM. On Windows, the GPU driver allocates a fixed shared GPU memory budget from system RAM. For the 890M, that budget caps at roughly 3.5GB.

I tried everything:

Registry hacks (GpuMemoryAllocation, HwSchMode) → no effect
DirectML via PyTorch → OOM on anything above a 7B model
llama.cpp with Vulkan backend → same 3.5GB limit, enforced by the driver
Disabling Memory Integrity / VBS → freed up ~500MB, still hit the wall
BIOS UMA Frame Buffer tweaking → the 890M's firmware-based allocation is hard-coded

The root cause? Windows' WDDM driver model + Hypervisor-based security (VBS). When Memory Integrity is enabled (and on modern Windows 11 installs, it is by default), every GPU memory allocation goes through an extra hypervisor verification layer. The driver responds by clamping shared GPU memory to a conservative default. And there's no "unlock" switch — even in Group Policy or AMD's own Adrenalin control panel.

For a 7B model at Q4_K_M (~4.5GB VRAM needed), that 3.5GB wall means the model doesn't fit. Period. The alternative is CPU-only inference — using all 96GB of RAM, but at maybe 2-3 tokens per second for anything larger than 7B.

Windows gives you a choice: tiny models on GPU, or glacial models on CPU. There's no middle ground.

The Linux Promise (and Its Own Problem)

Linux doesn't have this VRAM cap. The Mesa Vulkan driver for AMD GPUs (amdgpu, radv) lets the GPU use as much system memory as needed. You can allocate 64GB for a model and the driver won't blink.

So I set up a dual-boot, installed ROCm, and ran a few tests. The performance difference was dramatic — llama.cpp with Vulkan on Linux could load a 14B Q4_K_M model entirely on the GPU, no VRAM wall.

But there was a catch. ROCm does not support the Radeon 890M.

AMD's official stance, confirmed by the Framework community (Framework Laptop 13 Ryzen AI 9 HX 370):

"AMD ROCm does not support the Radeon 890M (gfx1150). PyTorch cannot run with ROCm on this GPU."

The Vulkan backend in llama.cpp does work on Linux. It's faster than CPU and has no VRAM cap. But it's nowhere near as fast as native ROCm acceleration would be. You're getting maybe 40-60% of the performance you'd get with a supported GPU stack.

The Decision Matrix

Option	Works?	Performance	VRAM Limit	Verdict
Windows + DirectML	✅	Low	3.5GB hard cap	Fine for 3B-7B, useless for anything bigger
Windows + llama.cpp Vulkan	✅	Medium-low	3.5GB hard cap	Same cap, slightly better perf
Linux + ROCm	❌	N/A	N/A	ROCm doesn't support gfx1150
Linux + llama.cpp Vulkan	✅	Medium	No cap	Best option for iGPU inference today
Linux + CPU-only (96GB)	✅	Slow (2-3 tok/s)	No cap	Works for everything, patience required

Why This Matters

The HX 370 is AMD's flagship mobile AI chip. It's built on a 4nm process, has an XDNA 2 NPU rated for 50+ TOPS, and pairs with a capable RDNA 3.5 iGPU. AMD clearly wants this chip in the "AI PC" category.

But the reality:

AMD's GPU software stack is fragmented. ROCm works on their discrete GPUs (RX 7900 series, some W-series) and Instinct cards. It does not work on RDNA 3.5 integrated GPUs. Period. The Windows HIP SDK also doesn't support it. If you bought an HX 370 machine thinking ROCm would be available, you bought a paperweight for AI workloads.
Windows' driver model is actively hostile to shared-memory GPUs. The 3.5GB hard cap isn't a bug. It's a deliberate safety boundary in WDDM. Apple Silicon Macs can share 64GB+ between CPU and GPU seamlessly. The HX 370 with 96GB of RAM should be able to do the same. It cannot. Not on Windows.
Linux works, but not well enough. The Vulkan backend lifts the VRAM cap, which is the most important win. But without ROCm, you're leaving performance on the table — roughly 40-60% of what a properly supported GPU would deliver.

So What Does Work?

For this exact machine (HX 370 + Radeon 890M), the realistic stack is:

For production workloads / multi-model serving: Linux + CPU-only. 96GB RAM handles Gemma 4 27B, Llama 3 70B, Qwen 2.5 72B at Q3/Q4. Slow but reliable.

For single-user inference / experimentation: Linux + llama.cpp Vulkan. Best balance of VRAM headroom and acceleration. No hard cap, no heavy tuning.

For portability / Windows-only environments: Windows + llama.cpp Vulkan. Stuck at the 3.5GB cap. Fine for 3B models, frustrating for anything larger.

The Structural Problem

This isn't about Windows vs Linux fanboyism. The structural issue is:

AMD's software commitment stops at their discrete GPU line. Integrated GPUs are second-class citizens for AI — despite being in every Ryzen AI chip shipping today.
Windows' driver architecture assumes dGPUs with dedicated VRAM. Shared-memory GPUs are an afterthought, and the security hypervisor layer makes it worse.
Linux avoids the cap but can't fill the acceleration gap. Without ROCm, you're running on Vulkan — which is a translation layer, not a native compute stack.

The HX 370 is a great CPU with a mediocre AI story. If you're building a local inference box today, you're better off pairing it with a discrete GPU — or just skipping Windows entirely for AI workloads.

Using an HX 370 for LLMs? A ⭐ or a one-word issue tells me what to build next — helps more than you'd think.

Top comments (2)

UnitBuilds • May 26

Honestly, the RoCm support is a massive fail on AMD's end. You cant promote an alternative to Cuda, when only 'a select few' gpus are properly supported and saying 'you can use vulkan' is a poor excuse for wasted budget on marketing, instead of R&D.

keeper • May 27

Fair point, and you're right — AMD's "Vulkan is fine" stance is a cop-out. ROCm on the 890M is officially a dead end (AMD confirmed it to the Framework team), and the community hacks like HSA_OVERRIDE_GFX_VERSION are too brittle to rely on.

That said, the article's main point still stands — the VRAM cap is the real enemy. Even without ROCm, Linux + Vulkan can load a 14B model entirely on GPU, while Windows can't get past 7B. Vulkan isn't fast, but it's 2-3x CPU.

The real fix ended up being the OCuLink route — added an RTX 5060 Ti via external dock, and suddenly the whole ROCm/VRAM debate became irrelevant. CUDA just works. That's the next post.