This article was originally published on runaihome.com
TL;DR: The AMD Ryzen AI Max+ 395 hits 100 t/s on Qwen3-30B and runs 120B models that physically don't fit on any single consumer discrete GPU — in a $1,499–$1,999 mini PC. It's bandwidth-constrained (256 GB/s vs 1,792 GB/s on an RTX 5090), so for models under 32B a discrete GPU is faster. The machine earns its price for one audience: people who need 70B+ fully in GPU memory, without a dedicated GPU tower.
| Strix Halo Mini PC | RTX 5060 Ti 16GB Build | Mac Mini M4 Pro 48GB | |
|---|---|---|---|
| Best for | 70B–120B models entirely in GPU memory | ≤13B at 80–130 t/s, budget build | 30–70B, silent, efficient |
| Price | $1,499–$1,999 (complete) | ~$1,400 (complete build) | $1,399 (complete) |
| The catch | Bandwidth bottleneck; Linux preferred | Hard ceiling at 16GB VRAM | 48GB max without $4,999 Ultra |
Honest take: For Llama 3.3 70B or DeepSeek R1 70B fully in GPU memory without a dedicated GPU tower, the GMKtec EVO-X2 at $1,499 is hard to beat on x86 — but if you can live with 48GB, a Mac Mini M4 Pro is simpler and draws less than half the power.
What Strix Halo actually is
Strix Halo is AMD's internal codename for the die inside the Ryzen AI Max+ 395. The unusual part is the memory architecture: 128 GB of LPDDR5X-8000 on a 256-bit bus, shared between the CPU and an integrated 40-compute-unit RDNA 3.5 GPU (the Radeon 8060S). There's no PCIe bottleneck, no VRAM ceiling separate from system RAM — the GPU sees all 128 GB at full memory bandwidth.
In practice, the chip can allocate up to 96 GB to the GPU, leaving the remaining 32 GB for the OS and CPU-side workloads. That's a larger GPU memory pool than any consumer discrete GPU, including the RTX 5090 (32 GB).
The rest of the spec sheet: 16 Zen 5 CPU cores clocked up to 5.1 GHz, a 50+ TOPS XDNA 2 NPU, and a configurable TDP range of 45W–120W with a 55W default. AMD fabbed it on TSMC 4nm. These chips ship in mini PCs — no PCIe card to install, no separate PSU math to worry about.
The actual benchmark numbers
All results below come from community testing on a Beelink GTR9 Pro running Ubuntu 24.04 with Mesa RADV (kisak PPA, version 26.0.6–26.1.1), llama.cpp builds b9049–b9467, Ollama 0.23.1, and AMD_VULKAN_ICD=RADV set for the Vulkan backend. Tested May–June 2026.
| Model | Quantization | Backend | Generation (t/s) |
|---|---|---|---|
| Qwen3-30B-A3B (MoE) | IQ4_XS | RADV Vulkan | 100.04 |
| Qwen3-Coder 30B-A3B | Q4_K_S | RADV Vulkan | 98.51 |
| Qwen3-Coder 30B-A3B | UD-Q4_K_XL | RADV Vulkan | 96.76 |
| GPT-OSS 120B | MXFP4 | RADV Vulkan | 55.57 |
| Qwen3.6 | Q4_0 (speed-first) | RADV Vulkan | ~81 |
| Qwen3.6 | balanced | RADV Vulkan | ~63 |
100 t/s on a 30B model is comfortable real-time speed for single-user inference. The more striking number is GPT-OSS 120B at 55 t/s: a 120-billion-parameter model running entirely in unified memory at a speed that makes it useful for single-user chat.
Why MoE models run faster here: the 30B-A3B variants (Qwen3's Mixture-of-Experts architecture) activate only ~3B parameters per forward pass despite having 30B total weights. On a bandwidth-constrained system, fewer weights loaded per token means directly higher tokens/sec. If you're running Strix Halo hardware, prioritize MoE-architecture models — the performance advantage is significant.
The real-world bandwidth measurement confirms the constraint: the system delivers ~215 GB/s measured versus the theoretical 256 GB/s peak, a ~16% gap typical for LPDDR5X under mixed CPU+GPU load.
What fits in memory — and what doesn't
The GPU can access up to 96 GB of the 128 GB pool. At Q4_K_M quantization:
| Model | Approx. VRAM needed | Fits on Strix Halo? | Fits on RTX 4090 (24GB)? |
|---|---|---|---|
| Llama 3.3 70B | ~42–48 GB | Yes — ~48 GB headroom left | No (CPU offload needed) |
| Qwen3-30B (dense) | ~18 GB | Yes | Yes |
| DeepSeek R1 70B distill | ~42 GB | Yes | No (CPU offload needed) |
| GPT-OSS 120B | ~65–70 GB | Yes (tight) | No |
| DeepSeek R1 671B | ~380 GB | No — needs multi-node | No |
| Llama 4 Maverick 402B | ~230+ GB | No | No |
An RTX 5060 Ti 16GB hits its ceiling around 13B Q4_K_M. An RTX 4090 at 24 GB tops out near 20B before requiring CPU offloading. On Strix Halo, Llama 3.3 70B loads entirely into the GPU memory pool — no CPU offloading, no PCIe bottlenecking. The VRAM math behind these numbers is covered in detail in How Much VRAM Do You Need for Llama Models.
Strix Halo vs a discrete GPU build
This is the decision that actually matters, and the answer is unambiguous in both directions.
Where discrete GPUs win: every model under 32 GB. An RTX 5090 generates ~186 t/s on Qwen3 8B Q4_K_M. The same model on Strix Halo runs around 80–90 t/s. Memory bandwidth is the reason: 1,792 GB/s on the RTX 5090 vs ~215 GB/s real-world on Strix Halo. For a daily 7B or 14B coding assistant — see the local AI coding stack with Continue.dev + Ollama — a mid-range discrete GPU outperforms Strix Halo and often costs less.
Where Strix Halo wins: any model above 24 GB that you need running fully in GPU memory. An RTX 4090 can't load Llama 3.3 70B without splitting layers to CPU RAM, which drops generation speed to 2–5 t/s. Strix Halo loads it in ~40 seconds and generates at ~30–35 t/s. That's a 6–15× speed difference on the same model.
The cloud comparison matters here too. Running Llama 3.3 70B on RunPod costs $0.29–$0.59/hour depending on GPU availability. At $1,499 for a GMKtec EVO-X2 running 6 hours/day, you break even at roughly 700–1,400 hours of use — around 4–8 months of daily active use. After that, every inference is free. We ran this calculation in detail in the RunPod vs local GPU: when to rent vs buy article.
The Mac comparison
The closest comparison is the Mac Mini M4 Pro, which starts at $1,399 with 24 GB unified memory and maxes out at 48 GB for $1,799. Its memory bandwidth is 273 GB/s — slightly above Strix Halo's real-world 215 GB/s.
For models that fit in 48 GB, the Mac Mini M4 Pro holds three advantages: substantially better power efficiency (20–30W under LLM load vs 60–120W for Strix Halo mini PCs), meaningfully more mature software (Metal via MLX is better-tuned than AMD's Vulkan/ROCm path on Linux), and quieter operation under sustained load.
Strix Halo's advantage is the 128 GB tier. If you need the full 96 GB GPU pool for 70B+ models, the Mac route to 128 GB requires the Mac Studio M4 Ultra at $4,999. A $1,499–$1,999 Strix Halo mini PC delivers 96 GB GPU memory at roughly one-third the price — the software experience is rougher, but the hardware value is real.
What you can actually buy right now
Prices verified June 2026:
| Machine | Memory | Storage | Price | Notes |
|---|---|---|---|---|
| GMKtec EVO-X2 | 128GB LPDDR5X | 2TB NVMe | ~$1,499 | Best value, 2.5GbE |
| Beelink GTR9 Pro | 128GB LPDDR5X | 2TB NVMe | $1,899–$1,999 | Dual 10GbE, better cooling |
| MINISFORUM MS-S1 Max | 128GB LPDDR5X | 2TB NVMe | ~$2,299 | Available on Newegg |
| GMKtec EVO-X2 (64GB) | 64GB LPDDR5X | 1TB NVMe | ~$1,099 | GPU pool ~48GB, still runs 70B |
The GMKtec EVO-X2 at $1,499 is the price-performance sweet spot. It has the same CPU and GPU as the Beelink GTR9 Pro and omits the dual 10GbE NICs — which you don't need for single-user home inference. The Beelink's dual 10GbE matters if you're running a shared home AI server. For that use case, the Open WebUI multi-user setup guide covers the server con
Top comments (0)