If you are building AI-powered apps and feeling the cost of cloud API bills — or the anxiety of sending user data off-device — Lemonade is worth your time.
Lemonade is an open-source local AI server (3.7k stars, sponsored by AMD) that runs LLMs, image generation, speech-to-text, and text-to-speech entirely on your own hardware. It exposes a standard OpenAI-compatible API, so switching from a cloud provider means changing one URL. The project just shipped v10.3, its biggest release yet.
What is Lemonade?
Lemonade installs as a system service and manages everything: model downloads, backend selection, and a unified REST endpoint at http://localhost:13305/v1. Under the hood it wires together proven inference engines:
- llama.cpp for GGUF LLMs (Vulkan, ROCm, CPU, Metal)
- OnnxRuntime GenAI / FastFlowLM for NPU-accelerated FLM models
- whisper.cpp for speech-to-text
- stable-diffusion.cpp for image generation
- Kokoro for text-to-speech
Apps like n8n, VS Code GitHub Copilot, Open WebUI, Continue, OpenHands, and Dify already integrate with it out of the box via the standard OpenAI API.
What is new in v10.3 (Latest Release)
v10.3 is a landmark release with three headline changes.
Desktop app is now 10x smaller. The app migrated from Electron to Tauri, a Rust-based cross-platform framework that uses the system's native webview instead of bundling Chromium. macOS and Windows binaries dropped from ~101-107 MB to ~7-9 MB.
OmniRouter for true omni-modal chat. The new OmniRouter unifies all backends — text, image, speech, vision — into a single OpenAI-compatible endpoint. You can interact with these modalities as tools in an agentic loop, making natural-language requests like "generate an image of X and then describe it" without gluing separate API calls together.
ROCm 7 support with multiple channels. ROCm 7.2 stable, 7.12 preview, and TheRock nightly builds are all supported. The 7.12 preview is now the default.
Other notable changes in v10.3:
- Light mode theme added to the GUI
- Easy llama.cpp version pinning and auto-update
- AppImage removed for Linux; use the web app or Snap instead
-
lemonade-server-minimal.msideprecated (will be removed in a future release) -
amd_igpuandamd_dgpuconsolidated toamd_gpuin the system-info endpoint -
nvidia_dgpurenamed tonvidia_gpufor consistency
Recent release history at a glance
| Version | Key headline |
|---|---|
| v10.3 (latest) | OmniRouter, Tauri app (10x smaller), ROCm 7 |
| v10.2 | Embeddable Lemonade binary, Qwen Image models, OpenCode integration |
| v10.1 | Gemma 4 on GPU, super-resolution (Real-ESRGAN), new lemonade CLI, default port changed to 13305 |
| v10.0 | Claude Code integration, Fedora RPM installer, NPU on Linux, FLM multi-modal |
| v9.4 | Qwen 3.5 on ROCm/Vulkan, redesigned app, image editing endpoint |
Install
Windows
Download lemonade.msi from the releases page. This installs both the server and the Tauri desktop app.
Ubuntu / Debian (PPA)
sudo add-apt-repository ppa:lemonade-team/stable
sudo apt install lemonade-server
Snap
sudo snap install lemonade-server
macOS (beta)
Download the .pkg from the releases page.
Docker
docker run -p 13305:13305 lemonade-sdk/lemonade-server
RPM (Fedora)
# Download the .rpm from the releases page
sudo rpm -i lemonade-server-10.3.0.x86_64.rpm
Your first model in 3 commands
Note: as of v10.1, the CLI command is lemonade (the old lemonade-server CLI is deprecated).
# See which backends your hardware supports
lemonade recipes
# Download a model
lemonade pull Gemma-3-4b-it-GGUF
# Run it
lemonade run Gemma-3-4b-it-GGUF
Other modalities work the same way:
# Image generation
lemonade run SDXL-Turbo
# Text-to-speech
lemonade run kokoro-v1
# Speech-to-text
lemonade run Whisper-Large-v3-Turbo
Integrating with your app
Because Lemonade exposes an OpenAI-compatible API, you swap it in with a single config change. The base URL as of v10.1 is http://localhost:13305/v1.
Python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:13305/v1",
api_key="lemonade" # required by the library, but unused by Lemonade
)
response = client.chat.completions.create(
model="Gemma-3-4b-it-GGUF",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarize the benefits of running AI locally."}
]
)
print(response.choices[0].message.content)
Node.js
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:13305/v1",
apiKey: "lemonade",
});
const response = await client.chat.completions.create({
model: "Gemma-3-4b-it-GGUF",
messages: [{ role: "user", content: "Hello from Node.js" }],
});
console.log(response.choices[0].message.content);
Lemonade also supports the Ollama API and the Anthropic API for apps that use those clients natively.
Supported hardware
| Hardware | Backend | Notes |
|---|---|---|
| AMD Radeon RDNA3 / RDNA4 GPU | ROCm | RX 7000/9000 series, Radeon PRO |
| Ryzen AI MAX (Strix Halo) | ROCm (gfx1151) | Windows and Ubuntu |
| Any Vulkan-capable GPU | Vulkan (llamacpp) | Broad compatibility |
| AMD Ryzen AI NPU (XDNA2) | FLM / FastFlowLM | Windows and Linux (beta) |
| Any x86_64 CPU | CPU | Universally available, slower |
| Apple Silicon | Metal | macOS beta |
Not sure what your machine supports? Run:
lemonade recipes
It auto-detects your hardware and lists exactly which backends are available.
Embeddable Lemonade
Since v10.2, you can bundle Lemonade as a portable binary inside your own application. Your users get local multi-modal AI without ever seeing a Lemonade installer or any Lemonade branding.
# Run lemond as a subprocess from your app
lemond ./
Full guide: lemonade-server.ai/docs/embeddable/
This is particularly useful for desktop app developers who want to ship local AI features without taking on the complexity of packaging inference engines themselves.
App integrations
Lemonade has a growing marketplace of first-class integrations. Highlights include:
- VS Code GitHub Copilot — use local models for code completions
-
Claude Code —
lemonade launch claudewires it up natively (added in v10.0) - Open WebUI — a polished ChatGPT-style UI, running locally
- Continue — local AI coding assistant for VS Code and JetBrains
- n8n — automate workflows with local AI nodes
- OpenHands — local AI agent for software engineering tasks
- Dify — LLM app building platform
- AnythingLLM — local knowledge base with RAG
OmniRouter: the key new API concept in v10.3
OmniRouter is worth calling out separately because it changes how you think about the API surface. Previously you would call separate endpoints for text, image, and speech. With OmniRouter, you interact through a single multi-modal endpoint and Lemonade routes each request to the correct backend engine automatically.
This means you can build agentic pipelines — for example, a loop that generates text, converts it to speech, and produces an image — all through one unified client without managing multiple base URLs or backend configurations.
How it compares to Ollama and LM Studio
These tools all solve similar problems. Where Lemonade stands out:
- NPU support — one of the very few tools that accelerates inference on the AMD XDNA2 NPU in Ryzen AI laptops
- True multi-modal in one server — text, images, speech-to-text, and TTS from a single API endpoint
- OmniRouter — automatic multi-modal routing without manual backend wiring
- Embeddable binary — package it inside your own app with no Lemonade branding
- Multiple API standards — OpenAI, Anthropic, and Ollama APIs simultaneously
- AMD-first optimizations — deep ROCm integration and NPU tooling maintained by AMD engineers
If you are on NVIDIA with a mainstream GPU and only need text generation, Ollama is slightly simpler to get started. For AMD hardware, AI PCs with NPUs, or multi-modal workloads, Lemonade covers more ground.
Quick reference
| Resource | Link |
|---|---|
| GitHub | github.com/lemonade-sdk/lemonade |
Top comments (0)