DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at aifoss.dev

AMD Lemonade Review 2026: GPU, NPU, and Multi-Modal

This article was originally published on aifoss.dev

TL;DR: Lemonade v10.6 is AMD's open-source LLM server that adds NPU prefill acceleration, image gen, and speech to one OpenAI-compatible endpoint. NPU acceleration works only on Ryzen AI 300/400 chips — on other hardware, Ollama's ecosystem is wider. AMD Ryzen AI users should pick Lemonade; everyone else should consider Ollama first.

Lemonade v10.6 Ollama v0.6 LocalAI
Best for AMD GPU + NPU hybrid, multi-modal Cross-platform, broadest ecosystem OpenAI API proxy, any hardware
Install winget or Snap curl one-liner Docker Compose
Hardware AMD RDNA3+, NVIDIA, Apple M, CPU Any GPU Any hardware
Model formats GGUF, ONNX, FLM, SafeTensors GGUF (Ollama manifest) GGUF, OpenVINO, more
Multi-modal LLM + image gen + Whisper + TTS LLM + vision models LLM + Whisper + SD
The catch NPU only on Ryzen AI 300/400 No NPU acceleration High setup complexity

Honest take: On a Ryzen AI 300-series machine, Lemonade is the better daily driver — it uses hardware that Ollama leaves idle and bundles image gen plus speech in one package. On Nvidia hardware or wherever you need maximum integration coverage, stick with Ollama.


What Lemonade Is and Why AMD Built It

Ollama solved cross-platform local LLM deployment cleanly. But it left AMD NPU owners with idle hardware — the dedicated AI accelerators in Ryzen AI chips sat unused because Ollama has no FastFlowLM backend.

Lemonade is AMD's answer. Released under Apache 2.0 and available at github.com/lemonade-sdk/lemonade, it bundles:

  • An OpenAI-compatible HTTP API at http://localhost:13305/v1
  • llama.cpp with Vulkan backend for AMD and NVIDIA GPUs
  • FastFlowLM for XDNA2 NPU acceleration on Ryzen AI chips
  • Stable Diffusion image generation
  • Whisper speech-to-text
  • Kokoro text-to-speech
  • A model manager with one-command downloads from Hugging Face

The core design difference from Ollama is hardware-tier splitting. On a Ryzen AI 300-series chip, prompt processing (prefill) goes to the NPU while token generation (decode) goes to the iGPU. This is not marketing — the NPU has better compute throughput for dense matrix math during prefill, and the iGPU has better memory bandwidth for sequential token generation. The result is lower Time to First Token on long system prompts and agentic chains.

Current version: v10.6.0 (released May 21, 2026). Linux NPU support shipped with Lemonade 10.0 in March 2026 via the FastFlowLM runtime.


Hardware Compatibility

Platform Backend Notes
AMD Ryzen AI 300/400 (XDNA2) FastFlowLM NPU + Vulkan iGPU Strix Halo supports up to 128 GB unified memory
AMD Radeon discrete (RDNA2/3/4) llama.cpp + Vulkan Standard VRAM limits; add 2–4 GB overhead
NVIDIA (Turing–Blackwell) llama.cpp + Vulkan or CUDA CUDA backend available since v10+
Apple Silicon (M1–M4) Metal via llama.cpp Unified memory; M4 Max competitive at large models
x86_64 CPU llama.cpp CPU Small models only; no hardware acceleration

NPU acceleration requires Ryzen AI 300-series or 400-series specifically — the XDNA2 architecture. Earlier Ryzen AI chips (7000, 8000, 200-series) have NPUs that no current runtime supports for LLM inference. On those systems, Lemonade falls back to Vulkan on the GPU, which is functionally the same as running Ollama.

Supported Linux distros: Ubuntu 24.04+, Fedora 43+, Debian Trixie+, Arch. Docker and Snap packages are available. For hardware context on AMD GPU builds, see runaihome.com for current RDNA4 GPU benchmarks and build guides.


Installation

Windows

winget install AMD.LemonadeServer
Enter fullscreen mode Exit fullscreen mode

This installs the server and a Tauri desktop app (system-tray GUI for model downloads and server management). Alternatively, grab the .msi from the GitHub releases page. After install, the server starts automatically on port 13305.

Linux (Ubuntu 24.04+)

# Snap — works across Ubuntu 24.04+, Fedora 43+, Arch
sudo snap install lemonade

# Docker
docker run -d --gpus all -p 13305:13305 lemonadesdk/lemonade:latest
Enter fullscreen mode Exit fullscreen mode

For NPU support on Linux, you need the XDNA driver and FastFlowLM runtime installed separately — the Lemonade docs cover the dependency chain. It is more involved than the Windows path. For most Linux users without a Ryzen AI 300/400 chip, the Snap install with Vulkan fallback is the practical path.

Verify the server is running

curl http://localhost:13305/v1/models
Enter fullscreen mode Exit fullscreen mode

Expected output on a fresh install with no models downloaded:

{"object":"list","data":[]}
Enter fullscreen mode Exit fullscreen mode

Running Your First Model

lemonade run Gemma-4-E2B-it-GGUF
Enter fullscreen mode Exit fullscreen mode

This pulls the model from Hugging Face (if not cached) and starts a chat session in your terminal. The model manager uses Hugging Face slug format — you can also import any custom GGUF or ONNX model from Hugging Face directly.

Check which backend Lemonade selected for your hardware:

curl http://localhost:13305/stats
Enter fullscreen mode Exit fullscreen mode

The response includes the active inference engine: vulkan, fastflowlm, rocm, or cpu. If you expected fastflowlm and got vulkan, check that your XDNA driver is installed and you're on a Ryzen AI 300/400 chip.

To test image generation:

curl http://localhost:13305/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{"prompt": "a terminal screen in a dark room", "n": 1}'
Enter fullscreen mode Exit fullscreen mode

NPU + GPU Hybrid: Numbers From Real Hardware

On a Ryzen AI Max+ 395 (Strix Halo, 128 GB unified memory), the NPU handles prompt processing and the iGPU handles decode. Community benchmarks from May–June 2026 on this configuration:

Model Quantization Tokens/sec
GPT-OSS 120B Q4_K_M ~50 tok/s
Qwen3.5-122B Q4 ~35 tok/s
Qwen3-Coder-Next Q4 ~43 tok/s

For comparison: an RTX 4090 running llama.cpp hits 50–80 tok/s on 7B models at Q4 but stalls on 70B+ without aggressive quantization (limited by 24 GB VRAM). The Strix Halo runs 120B at full Q4_K_M in 128 GB of unified memory — a different tier of capability.

On smaller Ryzen AI 300 systems (Strix Point, 32–64 GB), expect:

  • Llama 3.2-3B on NPU: ~28 tok/s at under 2 W
  • Models above 8B: fall back to iGPU via Vulkan

FastFlowLM 0.9.35, the current NPU runtime bundled in Lemonade 10.6, supports context windows up to 256k tokens on XDNA2 NPUs.


Multi-Modal in One Server

Lemonade bundles three additional inference backends behind the same API port:

Image generation: SDXL-Turbo via /v1/images/generations. Any client that supports the OpenAI image endpoint works — including the ComfyUI API adapter. See our ComfyUI API tutorial for chaining this into automated pipelines.

Speech-to-text: Whisper backend via /v1/audio/transcriptions. Uses the same model weights as whisper.cpp.

Text-to-speech: Kokoro TTS via /v1/audio/speech. Known limitation as of v10.6: voices not in the pre-configured list produce muted audio. Custom voice loading is not yet supported.

Running these three modalities as separate services (Ollama + ComfyUI + a Whisper server) adds coordination overhead — three processes, three ports, three model caches. Lemonade consolidates them into one service with one model manager. For a home server running all three, that's meaningful.


Connecting to Open WebUI

Open WebUI supports custom OpenAI-compatible endpoints. To add Lemonade:

  1. Open WebUI settings → ConnectionsAdd Connection
  2. API URL: http://localhost:13305/v1
  3. API key: leave blank (Lemonade does not validate keys)
  4. Save and confirm models appear in the model list

If you're running Open WebUI in Docker and Lemonade natively on the host:



http://host.docker.
Enter fullscreen mode Exit fullscreen mode

Top comments (0)