ArshTechPro

Posted on Apr 29

Lemonade v10.3: Run Local LLMs, Image Gen, and Speech on Your Own GPU for Free

#ai #opensource #llm #programming

If you are building AI-powered apps and feeling the cost of cloud API bills — or the anxiety of sending user data off-device — Lemonade is worth your time.

Lemonade is an open-source local AI server (3.7k stars, sponsored by AMD) that runs LLMs, image generation, speech-to-text, and text-to-speech entirely on your own hardware. It exposes a standard OpenAI-compatible API, so switching from a cloud provider means changing one URL. The project just shipped v10.3, its biggest release yet.

What is Lemonade?

Lemonade installs as a system service and manages everything: model downloads, backend selection, and a unified REST endpoint at http://localhost:13305/v1. Under the hood it wires together proven inference engines:

llama.cpp for GGUF LLMs (Vulkan, ROCm, CPU, Metal)
OnnxRuntime GenAI / FastFlowLM for NPU-accelerated FLM models
whisper.cpp for speech-to-text
stable-diffusion.cpp for image generation
Kokoro for text-to-speech

Apps like n8n, VS Code GitHub Copilot, Open WebUI, Continue, OpenHands, and Dify already integrate with it out of the box via the standard OpenAI API.

What is new in v10.3 (Latest Release)

v10.3 is a landmark release with three headline changes.

Desktop app is now 10x smaller. The app migrated from Electron to Tauri, a Rust-based cross-platform framework that uses the system's native webview instead of bundling Chromium. macOS and Windows binaries dropped from ~101-107 MB to ~7-9 MB.

OmniRouter for true omni-modal chat. The new OmniRouter unifies all backends — text, image, speech, vision — into a single OpenAI-compatible endpoint. You can interact with these modalities as tools in an agentic loop, making natural-language requests like "generate an image of X and then describe it" without gluing separate API calls together.

ROCm 7 support with multiple channels. ROCm 7.2 stable, 7.12 preview, and TheRock nightly builds are all supported. The 7.12 preview is now the default.

Other notable changes in v10.3:

Light mode theme added to the GUI
Easy llama.cpp version pinning and auto-update
AppImage removed for Linux; use the web app or Snap instead
lemonade-server-minimal.msi deprecated (will be removed in a future release)
amd_igpu and amd_dgpu consolidated to amd_gpu in the system-info endpoint
nvidia_dgpu renamed to nvidia_gpu for consistency

Recent release history at a glance

Version	Key headline
v10.3 (latest)	OmniRouter, Tauri app (10x smaller), ROCm 7
v10.2	Embeddable Lemonade binary, Qwen Image models, OpenCode integration
v10.1	Gemma 4 on GPU, super-resolution (Real-ESRGAN), new `lemonade` CLI, default port changed to 13305
v10.0	Claude Code integration, Fedora RPM installer, NPU on Linux, FLM multi-modal
v9.4	Qwen 3.5 on ROCm/Vulkan, redesigned app, image editing endpoint

Install

Windows

Download lemonade.msi from the releases page. This installs both the server and the Tauri desktop app.

Ubuntu / Debian (PPA)

sudo add-apt-repository ppa:lemonade-team/stable
sudo apt install lemonade-server

Snap

sudo snap install lemonade-server

macOS (beta)

Download the .pkg from the releases page.

Docker

docker run -p 13305:13305 lemonade-sdk/lemonade-server

RPM (Fedora)

# Download the .rpm from the releases page
sudo rpm -i lemonade-server-10.3.0.x86_64.rpm

Your first model in 3 commands

Note: as of v10.1, the CLI command is lemonade (the old lemonade-server CLI is deprecated).

# See which backends your hardware supports
lemonade recipes

# Download a model
lemonade pull Gemma-3-4b-it-GGUF

# Run it
lemonade run Gemma-3-4b-it-GGUF

Other modalities work the same way:

# Image generation
lemonade run SDXL-Turbo

# Text-to-speech
lemonade run kokoro-v1

# Speech-to-text
lemonade run Whisper-Large-v3-Turbo

Integrating with your app

Because Lemonade exposes an OpenAI-compatible API, you swap it in with a single config change. The base URL as of v10.1 is http://localhost:13305/v1.

Python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:13305/v1",
    api_key="lemonade"  # required by the library, but unused by Lemonade
)

response = client.chat.completions.create(
    model="Gemma-3-4b-it-GGUF",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Summarize the benefits of running AI locally."}
    ]
)

print(response.choices[0].message.content)

Node.js

import OpenAI from "openai";

const client = new OpenAI({
    baseURL: "http://localhost:13305/v1",
    apiKey: "lemonade",
});

const response = await client.chat.completions.create({
    model: "Gemma-3-4b-it-GGUF",
    messages: [{ role: "user", content: "Hello from Node.js" }],
});

console.log(response.choices[0].message.content);

Lemonade also supports the Ollama API and the Anthropic API for apps that use those clients natively.

Supported hardware

Hardware	Backend	Notes
AMD Radeon RDNA3 / RDNA4 GPU	ROCm	RX 7000/9000 series, Radeon PRO
Ryzen AI MAX (Strix Halo)	ROCm (gfx1151)	Windows and Ubuntu
Any Vulkan-capable GPU	Vulkan (llamacpp)	Broad compatibility
AMD Ryzen AI NPU (XDNA2)	FLM / FastFlowLM	Windows and Linux (beta)
Any x86_64 CPU	CPU	Universally available, slower
Apple Silicon	Metal	macOS beta

Not sure what your machine supports? Run:

lemonade recipes

It auto-detects your hardware and lists exactly which backends are available.

Embeddable Lemonade

Since v10.2, you can bundle Lemonade as a portable binary inside your own application. Your users get local multi-modal AI without ever seeing a Lemonade installer or any Lemonade branding.

# Run lemond as a subprocess from your app
lemond ./

Full guide: lemonade-server.ai/docs/embeddable/

This is particularly useful for desktop app developers who want to ship local AI features without taking on the complexity of packaging inference engines themselves.

App integrations

Lemonade has a growing marketplace of first-class integrations. Highlights include:

VS Code GitHub Copilot — use local models for code completions
Claude Code — lemonade launch claude wires it up natively (added in v10.0)
Open WebUI — a polished ChatGPT-style UI, running locally
Continue — local AI coding assistant for VS Code and JetBrains
n8n — automate workflows with local AI nodes
OpenHands — local AI agent for software engineering tasks
Dify — LLM app building platform
AnythingLLM — local knowledge base with RAG

OmniRouter: the key new API concept in v10.3

OmniRouter is worth calling out separately because it changes how you think about the API surface. Previously you would call separate endpoints for text, image, and speech. With OmniRouter, you interact through a single multi-modal endpoint and Lemonade routes each request to the correct backend engine automatically.

This means you can build agentic pipelines — for example, a loop that generates text, converts it to speech, and produces an image — all through one unified client without managing multiple base URLs or backend configurations.

How it compares to Ollama and LM Studio

These tools all solve similar problems. Where Lemonade stands out:

NPU support — one of the very few tools that accelerates inference on the AMD XDNA2 NPU in Ryzen AI laptops
True multi-modal in one server — text, images, speech-to-text, and TTS from a single API endpoint
OmniRouter — automatic multi-modal routing without manual backend wiring
Embeddable binary — package it inside your own app with no Lemonade branding
Multiple API standards — OpenAI, Anthropic, and Ollama APIs simultaneously
AMD-first optimizations — deep ROCm integration and NPU tooling maintained by AMD engineers

If you are on NVIDIA with a mainstream GPU and only need text generation, Ollama is slightly simpler to get started. For AMD hardware, AI PCs with NPUs, or multi-modal workloads, Lemonade covers more ground.

Quick reference

Resource	Link
GitHub	github.com/lemonade-sdk/lemonade

DEV Community