DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at runaihome.com

Mini PC for Local LLMs in 2026: Which $500–$1,500 Machines Actually Work

This article was originally published on runaihome.com

TL;DR: Mini PCs can run local LLMs if you match the machine to the model size. The real bottleneck is memory bandwidth, not TOPS or core count. A $650 Ryzen 8000 machine handles 7B–13B models at interactive speed; for 30B+ you need either 64GB DDR5 (slower but doable) or a Strix Halo machine with LPDDR5X-8000 (significantly faster, significantly pricier).

Ryzen 8000 32GB ($500–$700) Ryzen 8000 64GB ($700–$1,000) AI Max 395 64GB ($1,499)
Best for 7B–13B daily driver 28B–32B at moderate speed 70B+ at usable speed
Memory bandwidth ~80 GB/s (shared CPU/GPU) ~80 GB/s (shared CPU/GPU) ~256 GB/s (dedicated LPDDR5X)
Llama 3 8B speed 18–25 tok/s 18–25 tok/s 60+ tok/s
Llama 3.1 70B speed Won't fit (24GB VRAM max) 1–2 tok/s (partial offload) 4–8 tok/s
The catch Can't run 30B+ at speed Bandwidth still same as 32GB 3× the price of Tier 1

Honest take: The Beelink SER8 at ~$650 is the best pure-value pick for anyone whose workload fits inside 13B parameters. If you need 70B, the GMKtec EVO-X2 at $1,499 is the only mini PC under $2,000 that makes it feel practical.


The Spec That Actually Matters

Every mini PC manufacturer wants you to focus on TOPS. Intel Lunar Lake: 86 TOPS NPU. AMD Ryzen AI 9 HX 370: 50 TOPS NPU. Qualcomm Snapdragon X Elite: 45 TOPS.

These numbers are real but nearly useless for local LLM inference in 2026. Ollama, llama.cpp, and LM Studio don't route LLM workloads through the NPU. The NPU handles narrow, fixed-function AI tasks—face detection, noise suppression, Windows Studio Effects—not the autoregressive token generation that runs your chatbot.

The spec that actually controls your inference speed is memory bandwidth: how fast the processor can stream model weights from RAM into the compute cores. LLM inference is entirely memory-bandwidth-bound. Double the bandwidth and you roughly double the tokens per second. Add more TOPS without adding bandwidth and you get nothing for local LLMs.

Keep that in mind through every tier below.


Tier 1: The $500–$700 Ryzen 8000 Sweet Spot

The AMD Ryzen 8000 series—specifically chips like the 8845HS and 8945HS—turned the mini PC into a credible local AI node. The reason is the Radeon 780M integrated GPU, which supports ROCm-style iGPU offloading through Ollama's CUDA/ROCm path and has enough compute to actually run 7B models at interactive speed.

What to buy here:

The Minisforum UM890 Pro (Ryzen 9 8945HS, 32GB DDR5-5600, 1TB NVMe) runs about $650 at Amazon and Micro Center. The Beelink SER8 with the same 32GB DDR5 configuration prices nearly identically. Both machines use dual-channel DDR5-5600, which gives roughly 80–85 GB/s of theoretical memory bandwidth shared between the CPU and the Radeon 780M iGPU.

In practice, Ollama offloads model layers to the 780M and leaves the CPU lightly loaded during inference, so the GPU gets most of that bandwidth. Real-world results from community benchmarks: 18–25 tokens per second on Llama 3 8B at Q4, and about 5–8 tok/s on 13B models. That's fast enough for interactive chat and coding assistance.

What won't work at this tier:

Anything larger than about 20B parameters hits a wall. The UM890 Pro maxes out at 96GB of soldered DDR5 (both slots filled), but the bandwidth doesn't change—adding more RAM makes more models fit without crashing, not faster. A Llama 3.1 70B model at Q4 requires about 40GB just for weights; it won't fit in VRAM at all on 32GB, so Ollama partial-offloads to CPU, and you'll see 1–2 tok/s if it runs at all.

Best use case: A dedicated always-on home AI server for 7B–13B models. These machines idle at 10–15 watts and draw 25–65 watts under inference load. Running 8 hours of inference daily at $0.12/kWh costs roughly $3–5 per month. That's dramatically cheaper than $20/month for ChatGPT Plus if you're doing volume work.


Tier 2: 64GB DDR5 — The Bigger-Model Compromise ($700–$1,000)

The Minisforum UM890 Pro in 64GB DDR5 configuration costs roughly $729. The bandwidth story hasn't changed—still the same dual-channel DDR5-5600 bus—but now you have enough memory headroom to actually load 28B and 32B models without partial CPU offloading.

Community benchmarks on the UM890 Pro with 64GB DDR5 show Gemma 4 28B at about 19.5 tok/s and Qwen3.5-32B at 20.8 tok/s. Those speeds are possible because the entire model fits in GPU-accessible unified memory without any CPU offloading penalty. At 20 tok/s, reading a response feels immediate—the bottleneck shifts to your reading speed, not the model.

This is the tier that gets overlooked. People jump from "Ryzen 8000 mini PC" to "I need an AI Max machine" without realizing a $729 machine with 64GB runs 30B models well. If your use case is coding assistance, document summarization, or casual chat with models like Qwen3.5-32B or Phi-4, you don't need the next tier.

Where it still falls short: 70B models. Even with 64GB of unified memory, the bandwidth ceiling means 70B Q4 would crawl at sub-3 tok/s—unusable for interactive work. For 70B inference at practical speeds, you need the next tier.


Tier 3: AMD Strix Halo — The $1,499 Turning Point

The AMD Ryzen AI Max+ 395 (code-named Strix Halo) changes the architecture meaningfully. Instead of regular dual-channel DDR5, it uses a 256-bit LPDDR5X-8000 bus with up to 128GB of unified memory. Theoretical bandwidth: 256 GB/s. Measured GPU bandwidth: approximately 215 GB/s in practice.

That's more than 2.5× the bandwidth of a regular Ryzen 8000 mini PC. And unlike adding more DDR5 sticks, this improvement directly translates to faster tokens per second at every model size.

What to buy here:

The GMKtec EVO-X2 ships with the Ryzen AI Max+ 395, currently priced at $1,499 for the 64GB/1TB variant and $1,999 for the 128GB/2TB variant (promotional pricing from GMKtec's site as of May 2026; original MSRP was higher). The Beelink GTR9 Pro uses the same processor and LPDDR5X memory architecture at a similar price point.

Real performance numbers:

  • Llama 3 8B at Q4: 60+ tok/s — effectively instant for interactive use
  • Llama 3.1 70B at Q4: 4–8 tok/s — usable for single-turn queries, a bit slow for rapid back-and-forth
  • Llama 3.3 70B at Q6_K: 3.7–3.8 tok/s (verified on the EVO-X2 by independent benchmarks)
  • Qwen3:235B (128GB model only): ~11 tok/s — functional on the 128GB machine

The 64GB machine can run 70B models with room to spare. At 4–8 tok/s, 70B inference sits at the lower bound of comfortable interactive use—you'll wait a few seconds for a long response, but short coding queries or Q&A are fine. For always-on batch processing or API serving where throughput matters more than latency, 70B on this hardware is genuinely practical.

The trade-off: Under sustained AI inference load, the Ryzen AI Max+ 395 machines draw 60–120 watts. That's 4–6× the idle power of a Tier 1 machine. Not a dealbreaker, but worth factoring into total cost of ownership over 2–3 years.

For a full cost-versus-cloud comparison at this performance tier, see our piece on QLoRA on RTX 4090 total cost vs RunPod—the math framework applies equally to mini PC ownership vs renting cloud GPUs at RunPod.


Where Does Mac Mini Fit?

The Mac Mini M4 Pro—covered in detail in our dedicated review—enters the picture around $1,399 for the base M4 (16GB unified) and $2,199 for the M4 Pro 64GB configuration. The M4 Pro's 14-core CPU and 20-core GPU share 273 GB/s of unified memory bandwidth, which is slightly above the

Top comments (0)