This article was originally published on aifoss.dev
TL;DR: Lemonade v10.6 is AMD's open-source LLM server that adds NPU prefill acceleration, image gen, and speech to one OpenAI-compatible endpoint. NPU acceleration works only on Ryzen AI 300/400 chips — on other hardware, Ollama's ecosystem is wider. AMD Ryzen AI users should pick Lemonade; everyone else should consider Ollama first.
| Lemonade v10.6 | Ollama v0.6 | LocalAI | |
|---|---|---|---|
| Best for | AMD GPU + NPU hybrid, multi-modal | Cross-platform, broadest ecosystem | OpenAI API proxy, any hardware |
| Install |
winget or Snap |
curl one-liner |
Docker Compose |
| Hardware | AMD RDNA3+, NVIDIA, Apple M, CPU | Any GPU | Any hardware |
| Model formats | GGUF, ONNX, FLM, SafeTensors | GGUF (Ollama manifest) | GGUF, OpenVINO, more |
| Multi-modal | LLM + image gen + Whisper + TTS | LLM + vision models | LLM + Whisper + SD |
| The catch | NPU only on Ryzen AI 300/400 | No NPU acceleration | High setup complexity |
Honest take: On a Ryzen AI 300-series machine, Lemonade is the better daily driver — it uses hardware that Ollama leaves idle and bundles image gen plus speech in one package. On Nvidia hardware or wherever you need maximum integration coverage, stick with Ollama.
What Lemonade Is and Why AMD Built It
Ollama solved cross-platform local LLM deployment cleanly. But it left AMD NPU owners with idle hardware — the dedicated AI accelerators in Ryzen AI chips sat unused because Ollama has no FastFlowLM backend.
Lemonade is AMD's answer. Released under Apache 2.0 and available at github.com/lemonade-sdk/lemonade, it bundles:
- An OpenAI-compatible HTTP API at
http://localhost:13305/v1 - llama.cpp with Vulkan backend for AMD and NVIDIA GPUs
- FastFlowLM for XDNA2 NPU acceleration on Ryzen AI chips
- Stable Diffusion image generation
- Whisper speech-to-text
- Kokoro text-to-speech
- A model manager with one-command downloads from Hugging Face
The core design difference from Ollama is hardware-tier splitting. On a Ryzen AI 300-series chip, prompt processing (prefill) goes to the NPU while token generation (decode) goes to the iGPU. This is not marketing — the NPU has better compute throughput for dense matrix math during prefill, and the iGPU has better memory bandwidth for sequential token generation. The result is lower Time to First Token on long system prompts and agentic chains.
Current version: v10.6.0 (released May 21, 2026). Linux NPU support shipped with Lemonade 10.0 in March 2026 via the FastFlowLM runtime.
Hardware Compatibility
| Platform | Backend | Notes |
|---|---|---|
| AMD Ryzen AI 300/400 (XDNA2) | FastFlowLM NPU + Vulkan iGPU | Strix Halo supports up to 128 GB unified memory |
| AMD Radeon discrete (RDNA2/3/4) | llama.cpp + Vulkan | Standard VRAM limits; add 2–4 GB overhead |
| NVIDIA (Turing–Blackwell) | llama.cpp + Vulkan or CUDA | CUDA backend available since v10+ |
| Apple Silicon (M1–M4) | Metal via llama.cpp | Unified memory; M4 Max competitive at large models |
| x86_64 CPU | llama.cpp CPU | Small models only; no hardware acceleration |
NPU acceleration requires Ryzen AI 300-series or 400-series specifically — the XDNA2 architecture. Earlier Ryzen AI chips (7000, 8000, 200-series) have NPUs that no current runtime supports for LLM inference. On those systems, Lemonade falls back to Vulkan on the GPU, which is functionally the same as running Ollama.
Supported Linux distros: Ubuntu 24.04+, Fedora 43+, Debian Trixie+, Arch. Docker and Snap packages are available. For hardware context on AMD GPU builds, see runaihome.com for current RDNA4 GPU benchmarks and build guides.
Installation
Windows
winget install AMD.LemonadeServer
This installs the server and a Tauri desktop app (system-tray GUI for model downloads and server management). Alternatively, grab the .msi from the GitHub releases page. After install, the server starts automatically on port 13305.
Linux (Ubuntu 24.04+)
# Snap — works across Ubuntu 24.04+, Fedora 43+, Arch
sudo snap install lemonade
# Docker
docker run -d --gpus all -p 13305:13305 lemonadesdk/lemonade:latest
For NPU support on Linux, you need the XDNA driver and FastFlowLM runtime installed separately — the Lemonade docs cover the dependency chain. It is more involved than the Windows path. For most Linux users without a Ryzen AI 300/400 chip, the Snap install with Vulkan fallback is the practical path.
Verify the server is running
curl http://localhost:13305/v1/models
Expected output on a fresh install with no models downloaded:
{"object":"list","data":[]}
Running Your First Model
lemonade run Gemma-4-E2B-it-GGUF
This pulls the model from Hugging Face (if not cached) and starts a chat session in your terminal. The model manager uses Hugging Face slug format — you can also import any custom GGUF or ONNX model from Hugging Face directly.
Check which backend Lemonade selected for your hardware:
curl http://localhost:13305/stats
The response includes the active inference engine: vulkan, fastflowlm, rocm, or cpu. If you expected fastflowlm and got vulkan, check that your XDNA driver is installed and you're on a Ryzen AI 300/400 chip.
To test image generation:
curl http://localhost:13305/v1/images/generations \
-H "Content-Type: application/json" \
-d '{"prompt": "a terminal screen in a dark room", "n": 1}'
NPU + GPU Hybrid: Numbers From Real Hardware
On a Ryzen AI Max+ 395 (Strix Halo, 128 GB unified memory), the NPU handles prompt processing and the iGPU handles decode. Community benchmarks from May–June 2026 on this configuration:
| Model | Quantization | Tokens/sec |
|---|---|---|
| GPT-OSS 120B | Q4_K_M | ~50 tok/s |
| Qwen3.5-122B | Q4 | ~35 tok/s |
| Qwen3-Coder-Next | Q4 | ~43 tok/s |
For comparison: an RTX 4090 running llama.cpp hits 50–80 tok/s on 7B models at Q4 but stalls on 70B+ without aggressive quantization (limited by 24 GB VRAM). The Strix Halo runs 120B at full Q4_K_M in 128 GB of unified memory — a different tier of capability.
On smaller Ryzen AI 300 systems (Strix Point, 32–64 GB), expect:
- Llama 3.2-3B on NPU: ~28 tok/s at under 2 W
- Models above 8B: fall back to iGPU via Vulkan
FastFlowLM 0.9.35, the current NPU runtime bundled in Lemonade 10.6, supports context windows up to 256k tokens on XDNA2 NPUs.
Multi-Modal in One Server
Lemonade bundles three additional inference backends behind the same API port:
Image generation: SDXL-Turbo via /v1/images/generations. Any client that supports the OpenAI image endpoint works — including the ComfyUI API adapter. See our ComfyUI API tutorial for chaining this into automated pipelines.
Speech-to-text: Whisper backend via /v1/audio/transcriptions. Uses the same model weights as whisper.cpp.
Text-to-speech: Kokoro TTS via /v1/audio/speech. Known limitation as of v10.6: voices not in the pre-configured list produce muted audio. Custom voice loading is not yet supported.
Running these three modalities as separate services (Ollama + ComfyUI + a Whisper server) adds coordination overhead — three processes, three ports, three model caches. Lemonade consolidates them into one service with one model manager. For a home server running all three, that's meaningful.
Connecting to Open WebUI
Open WebUI supports custom OpenAI-compatible endpoints. To add Lemonade:
- Open WebUI settings → Connections → Add Connection
- API URL:
http://localhost:13305/v1 - API key: leave blank (Lemonade does not validate keys)
- Save and confirm models appear in the model list
If you're running Open WebUI in Docker and Lemonade natively on the host:
http://host.docker.
Top comments (0)