Jovan Chan

Posted on Jun 2 • Originally published at runaihome.com

AMD Lemonade Local LLM Server: GPU + NPU Inference on Consumer Hardware (2026 Guide)

#amd #localai #npu #gpu

This article was originally published on runaihome.com

TL;DR: Lemonade is AMD's open-source local LLM server that uses your Ryzen AI NPU to cut time-to-first-token in half while offloading sustained generation to your GPU — all through a single OpenAI-compatible endpoint at localhost:13305. It handles text, image gen, speech-to-text, and TTS in one install. The catch: NPU acceleration requires Ryzen AI 300/400-series hardware (XDNA 2); older Ryzen AI chips and Nvidia hardware get no NPU benefit.

	Lemonade	Ollama	LM Studio
Best for	AMD Ryzen AI 300+ with NPU	Any hardware, broadest model support	GUI-first beginners
NPU acceleration	✅ XDNA 2 (Ryzen AI 300/400)	❌	❌
Multi-modal	LLM + Image + STT + TTS built-in	LLM only	LLM + basic vision
AMD GPU setup	Auto-detects ROCm/Vulkan	Manual ROCm config required	Manual ROCm config required
The catch	Best results on AMD-only hardware	No NPU offloading anywhere	Not a server, no API by default

Honest take: If you own a Ryzen AI 300-series laptop or a Strix Halo system, Lemonade is the obvious choice — nothing else gives you NPU-accelerated TTFT plus unified multi-modal inference in one install. On Nvidia hardware or older AMD CPUs, use Ollama instead.

The problem Lemonade solves

Every guide about running local LLMs on AMD hardware eventually arrives at the same frustrating detour: ROCm. The ROCm driver stack is powerful but installation is brittle, GPU target strings change between releases, and getting llama.cpp to actually use your Radeon RX 7900 XTX rather than falling back to CPU can eat an afternoon.

AMD's answer is Lemonade — an open-source local AI server (GitHub: lemonade-sdk/lemonade, 3.7k stars as of May 2026) sponsored and co-developed by AMD engineers. It auto-detects your hardware, selects the right backend, and exposes everything through a single OpenAI-compatible REST endpoint. No manual ROCm flags. No GPU target strings. One install.

The more interesting innovation, though, is the NPU. Modern Ryzen AI 300/400-series laptops ship with an XDNA 2 neural processing unit delivering 50 TOPS of AI compute — and until Lemonade, that hardware sat mostly idle for LLM inference.

How the GPU + NPU split works

LLM inference has two distinct phases with very different compute profiles:

Prefill (prompt processing): The model ingests your entire input prompt and builds the KV cache. This is compute-bound — it needs raw matrix multiply throughput, not memory bandwidth. A prompt of 1,000 tokens needs thousands of matrix operations processed in parallel. The NPU excels here.

Decode (token generation): The model generates one token at a time. Each step needs to load the entire model's weights from memory to perform a single forward pass. This is memory-bandwidth-bound — sustained throughput depends on how fast weights can be read. The GPU wins here because it has wider memory buses.

Lemonade's hybrid execution exploits this split. On Ryzen AI 300/400-series hardware, it routes prompt processing through the XDNA 2 NPU and token generation through the integrated GPU (or a discrete Radeon if you have one). AMD's own benchmarks show the NPU delivers 2.3× faster time-to-first-token versus GPU-only inference, while GPU decode achieves 2.4× higher sustained throughput versus NPU-only decode. The hybrid mode combines both: fast startup from the NPU, sustained throughput from the GPU.

The backend doing the NPU work is FastFlowLM, a purpose-built runtime for AMD NPUs. Under the hood, Lemonade also orchestrates llama.cpp (for GGUF models on CPU/GPU via Vulkan or ROCm), whisper.cpp (speech-to-text), stable-diffusion.cpp (image generation), and Kokoro (text-to-speech). You don't configure any of this — it picks the backend based on what your hardware supports and what model format you're loading.

Hardware requirements

Full NPU + GPU hybrid: Ryzen AI 300/400-series (Strix Point)

The minimum hardware for NPU acceleration is a Ryzen AI 9 HX 370 or any other Ryzen AI 300-series chip (Strix Point). These APUs pack:

XDNA 2 NPU: 50 TOPS AI compute
Zen 5 CPU cores (up to 12 cores, 24 threads)
RDNA 3.5 iGPU (up to 16 CUs)
LPDDR5X system RAM (up to 32GB on typical laptop configs)

Windows 11 is required for NPU acceleration on these chips. Windows 10 is supported for CPU/GPU inference only.

On Linux, NPU support requires XDNA 2 specifically — the older XDNA 1 found in Ryzen AI 7000/8000/200-series chips is not supported for NPU inference via FastFlowLM on Linux. If you're on a Ryzen AI 7040 series (Hawk Point) or similar XDNA 1 hardware, you can still run Lemonade with GPU or CPU backends.

Maximum configuration: Ryzen AI MAX+ 395 (Strix Halo)

The Ryzen AI Max+ 395 (Strix Halo) is the standout platform for Lemonade in 2026:

XDNA 2 NPU: 50 TOPS
RDNA 3.5 iGPU: 40 compute units, 60 FP16 TFLOPS
Up to 128GB LPDDR5X unified memory (256-bit interface at 8,000 MT/s)
Up to 96GB of that pool usable as VRAM

The 128GB unified memory ceiling means Strix Halo can run dense 70B+ models entirely in-memory. Community benchmarks show impressive numbers on this hardware: Qwen3-Coder-Next at 43 t/s (Q4), Qwen3.5 35B-A3B at 55 t/s (Q4), and even GPT-OSS 120B reaching ~50 t/s. Dense 27B models are slower — Qwen3.5 27B lands at 11–12 t/s at Q4, the bandwidth cost of a fully-dense architecture at that size.

For context, an RTX 4090 achieves roughly 50–80 t/s on 7B models at Q4 with Ollama — competitive with Strix Halo at smaller scales, but the RTX 4090 tops out at 24GB VRAM with no path to 70B inference without CPU offloading.

Discrete Radeon GPUs

If you have a desktop with an AMD Radeon RX 7900 XTX or similar RDNA2/RDNA3/RDNA4 card, Lemonade supports GPU inference via ROCm or Vulkan. You won't get NPU acceleration — there's no NPU in a discrete Radeon — but you do get Lemonade's automatic backend selection, multi-modal stack, and unified API without manually configuring ROCm yourself.

Supported discrete GPU families: Radeon RX 6000 series (RDNA2), RX 7000 series (RDNA3), and RX 9000 series (RDNA4). The RX 9070 XT is the current value target for RDNA4 on desktop.

CPU fallback

No AMD GPU at all? Lemonade runs via llama.cpp CPU inference on any x86_64 machine. Performance is unsurprising — a Ryzen 9 7950X at Q4_K_M gets roughly 5–8 t/s on 7B models — but the setup path and API remain identical. Useful for testing or for workflows where latency doesn't matter.

Installation

Windows

Download the one-click installer from the Lemonade releases page. The installer detects your hardware, pulls the right backends, and registers Lemonade as a Windows service. After install, the server is live at http://localhost:13305/v1.

Requirements: Windows 10 (build 1809+) for CPU/GPU, Windows 11 for NPU on Ryzen AI 300/400.

Model downloads happen through the Lemonade UI or via API — no manual GGUF hunting required. Models are stored locally; they don't leave your machine.

Linux

With Lemonade 10.0.1, Debian packages are available via a PPA for Ubuntu-based distributions. Install ROCm drivers first if you have a Radeon GPU; the ROCm setup is still a prerequisite on Linux, but Lemonade handles everything above that layer.

For NPU on Linux: FastFlowLM requires XDNA 2 (Ryzen AI 300/400/Max series). The packages provide an improved setup process that Phoronix covered as a significant usability improvement over earlier versions.

macOS and Docker

Lemonade runs on macOS via CPU inference (no Metal/NPU backend as of v10.3). Docker images are available for containerized d

DEV Community