Jovan Chan

Posted on Jun 14 • Originally published at aifoss.dev

AMD Lemonade Review 2026: GPU, NPU, and Multi-Modal

#amd #llm #selfhosted #npu

This article was originally published on aifoss.dev

TL;DR: Lemonade v10.6 is AMD's open-source LLM server that adds NPU prefill acceleration, image gen, and speech to one OpenAI-compatible endpoint. NPU acceleration works only on Ryzen AI 300/400 chips — on other hardware, Ollama's ecosystem is wider. AMD Ryzen AI users should pick Lemonade; everyone else should consider Ollama first.

	Lemonade v10.6	Ollama v0.6	LocalAI
Best for	AMD GPU + NPU hybrid, multi-modal	Cross-platform, broadest ecosystem	OpenAI API proxy, any hardware
Install	`winget` or Snap	`curl` one-liner	Docker Compose
Hardware	AMD RDNA3+, NVIDIA, Apple M, CPU	Any GPU	Any hardware
Model formats	GGUF, ONNX, FLM, SafeTensors	GGUF (Ollama manifest)	GGUF, OpenVINO, more
Multi-modal	LLM + image gen + Whisper + TTS	LLM + vision models	LLM + Whisper + SD
The catch	NPU only on Ryzen AI 300/400	No NPU acceleration	High setup complexity

Honest take: On a Ryzen AI 300-series machine, Lemonade is the better daily driver — it uses hardware that Ollama leaves idle and bundles image gen plus speech in one package. On Nvidia hardware or wherever you need maximum integration coverage, stick with Ollama.

What Lemonade Is and Why AMD Built It

Ollama solved cross-platform local LLM deployment cleanly. But it left AMD NPU owners with idle hardware — the dedicated AI accelerators in Ryzen AI chips sat unused because Ollama has no FastFlowLM backend.

Lemonade is AMD's answer. Released under Apache 2.0 and available at github.com/lemonade-sdk/lemonade, it bundles:

An OpenAI-compatible HTTP API at http://localhost:13305/v1
llama.cpp with Vulkan backend for AMD and NVIDIA GPUs
FastFlowLM for XDNA2 NPU acceleration on Ryzen AI chips
Stable Diffusion image generation
Whisper speech-to-text
Kokoro text-to-speech
A model manager with one-command downloads from Hugging Face

The core design difference from Ollama is hardware-tier splitting. On a Ryzen AI 300-series chip, prompt processing (prefill) goes to the NPU while token generation (decode) goes to the iGPU. This is not marketing — the NPU has better compute throughput for dense matrix math during prefill, and the iGPU has better memory bandwidth for sequential token generation. The result is lower Time to First Token on long system prompts and agentic chains.

Current version: v10.6.0 (released May 21, 2026). Linux NPU support shipped with Lemonade 10.0 in March 2026 via the FastFlowLM runtime.

Hardware Compatibility

Platform	Backend	Notes
AMD Ryzen AI 300/400 (XDNA2)	FastFlowLM NPU + Vulkan iGPU	Strix Halo supports up to 128 GB unified memory
AMD Radeon discrete (RDNA2/3/4)	llama.cpp + Vulkan	Standard VRAM limits; add 2–4 GB overhead
NVIDIA (Turing–Blackwell)	llama.cpp + Vulkan or CUDA	CUDA backend available since v10+
Apple Silicon (M1–M4)	Metal via llama.cpp	Unified memory; M4 Max competitive at large models
x86_64 CPU	llama.cpp CPU	Small models only; no hardware acceleration

NPU acceleration requires Ryzen AI 300-series or 400-series specifically — the XDNA2 architecture. Earlier Ryzen AI chips (7000, 8000, 200-series) have NPUs that no current runtime supports for LLM inference. On those systems, Lemonade falls back to Vulkan on the GPU, which is functionally the same as running Ollama.

Supported Linux distros: Ubuntu 24.04+, Fedora 43+, Debian Trixie+, Arch. Docker and Snap packages are available. For hardware context on AMD GPU builds, see runaihome.com for current RDNA4 GPU benchmarks and build guides.

Installation

Windows

winget install AMD.LemonadeServer

This installs the server and a Tauri desktop app (system-tray GUI for model downloads and server management). Alternatively, grab the .msi from the GitHub releases page. After install, the server starts automatically on port 13305.

Linux (Ubuntu 24.04+)

# Snap — works across Ubuntu 24.04+, Fedora 43+, Arch
sudo snap install lemonade

# Docker
docker run -d --gpus all -p 13305:13305 lemonadesdk/lemonade:latest

For NPU support on Linux, you need the XDNA driver and FastFlowLM runtime installed separately — the Lemonade docs cover the dependency chain. It is more involved than the Windows path. For most Linux users without a Ryzen AI 300/400 chip, the Snap install with Vulkan fallback is the practical path.

Verify the server is running

curl http://localhost:13305/v1/models

Expected output on a fresh install with no models downloaded:

{"object":"list","data":[]}

Running Your First Model

lemonade run Gemma-4-E2B-it-GGUF

This pulls the model from Hugging Face (if not cached) and starts a chat session in your terminal. The model manager uses Hugging Face slug format — you can also import any custom GGUF or ONNX model from Hugging Face directly.

Check which backend Lemonade selected for your hardware:

curl http://localhost:13305/stats

The response includes the active inference engine: vulkan, fastflowlm, rocm, or cpu. If you expected fastflowlm and got vulkan, check that your XDNA driver is installed and you're on a Ryzen AI 300/400 chip.

To test image generation:

curl http://localhost:13305/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{"prompt": "a terminal screen in a dark room", "n": 1}'

NPU + GPU Hybrid: Numbers From Real Hardware

On a Ryzen AI Max+ 395 (Strix Halo, 128 GB unified memory), the NPU handles prompt processing and the iGPU handles decode. Community benchmarks from May–June 2026 on this configuration:

Model	Quantization	Tokens/sec
GPT-OSS 120B	Q4_K_M	~50 tok/s
Qwen3.5-122B	Q4	~35 tok/s
Qwen3-Coder-Next	Q4	~43 tok/s

For comparison: an RTX 4090 running llama.cpp hits 50–80 tok/s on 7B models at Q4 but stalls on 70B+ without aggressive quantization (limited by 24 GB VRAM). The Strix Halo runs 120B at full Q4_K_M in 128 GB of unified memory — a different tier of capability.

On smaller Ryzen AI 300 systems (Strix Point, 32–64 GB), expect:

Llama 3.2-3B on NPU: ~28 tok/s at under 2 W
Models above 8B: fall back to iGPU via Vulkan

FastFlowLM 0.9.35, the current NPU runtime bundled in Lemonade 10.6, supports context windows up to 256k tokens on XDNA2 NPUs.

Multi-Modal in One Server

Lemonade bundles three additional inference backends behind the same API port:

Image generation: SDXL-Turbo via /v1/images/generations. Any client that supports the OpenAI image endpoint works — including the ComfyUI API adapter. See our ComfyUI API tutorial for chaining this into automated pipelines.

Speech-to-text: Whisper backend via /v1/audio/transcriptions. Uses the same model weights as whisper.cpp.

Text-to-speech: Kokoro TTS via /v1/audio/speech. Known limitation as of v10.6: voices not in the pre-configured list produce muted audio. Custom voice loading is not yet supported.

Running these three modalities as separate services (Ollama + ComfyUI + a Whisper server) adds coordination overhead — three processes, three ports, three model caches. Lemonade consolidates them into one service with one model manager. For a home server running all three, that's meaningful.

Connecting to Open WebUI

Open WebUI supports custom OpenAI-compatible endpoints. To add Lemonade:

Open WebUI settings → Connections → Add Connection
API URL: http://localhost:13305/v1
API key: leave blank (Lemonade does not validate keys)
Save and confirm models appear in the model list

If you're running Open WebUI in Docker and Lemonade natively on the host:



http://host.docker.

DEV Community