DEV Community

Cover image for Lemonade v10.3: Run Local LLMs, Image Gen, and Speech on Your Own GPU for Free
ArshTechPro
ArshTechPro

Posted on

Lemonade v10.3: Run Local LLMs, Image Gen, and Speech on Your Own GPU for Free

If you are building AI-powered apps and feeling the cost of cloud API bills — or the anxiety of sending user data off-device — Lemonade is worth your time.

Lemonade is an open-source local AI server (3.7k stars, sponsored by AMD) that runs LLMs, image generation, speech-to-text, and text-to-speech entirely on your own hardware. It exposes a standard OpenAI-compatible API, so switching from a cloud provider means changing one URL. The project just shipped v10.3, its biggest release yet.


What is Lemonade?

Lemonade installs as a system service and manages everything: model downloads, backend selection, and a unified REST endpoint at http://localhost:13305/v1. Under the hood it wires together proven inference engines:

  • llama.cpp for GGUF LLMs (Vulkan, ROCm, CPU, Metal)
  • OnnxRuntime GenAI / FastFlowLM for NPU-accelerated FLM models
  • whisper.cpp for speech-to-text
  • stable-diffusion.cpp for image generation
  • Kokoro for text-to-speech

Apps like n8n, VS Code GitHub Copilot, Open WebUI, Continue, OpenHands, and Dify already integrate with it out of the box via the standard OpenAI API.


What is new in v10.3 (Latest Release)

v10.3 is a landmark release with three headline changes.

Desktop app is now 10x smaller. The app migrated from Electron to Tauri, a Rust-based cross-platform framework that uses the system's native webview instead of bundling Chromium. macOS and Windows binaries dropped from ~101-107 MB to ~7-9 MB.

OmniRouter for true omni-modal chat. The new OmniRouter unifies all backends — text, image, speech, vision — into a single OpenAI-compatible endpoint. You can interact with these modalities as tools in an agentic loop, making natural-language requests like "generate an image of X and then describe it" without gluing separate API calls together.

ROCm 7 support with multiple channels. ROCm 7.2 stable, 7.12 preview, and TheRock nightly builds are all supported. The 7.12 preview is now the default.

Other notable changes in v10.3:

  • Light mode theme added to the GUI
  • Easy llama.cpp version pinning and auto-update
  • AppImage removed for Linux; use the web app or Snap instead
  • lemonade-server-minimal.msi deprecated (will be removed in a future release)
  • amd_igpu and amd_dgpu consolidated to amd_gpu in the system-info endpoint
  • nvidia_dgpu renamed to nvidia_gpu for consistency

Recent release history at a glance

Version Key headline
v10.3 (latest) OmniRouter, Tauri app (10x smaller), ROCm 7
v10.2 Embeddable Lemonade binary, Qwen Image models, OpenCode integration
v10.1 Gemma 4 on GPU, super-resolution (Real-ESRGAN), new lemonade CLI, default port changed to 13305
v10.0 Claude Code integration, Fedora RPM installer, NPU on Linux, FLM multi-modal
v9.4 Qwen 3.5 on ROCm/Vulkan, redesigned app, image editing endpoint

Install

Windows

Download lemonade.msi from the releases page. This installs both the server and the Tauri desktop app.

Ubuntu / Debian (PPA)

sudo add-apt-repository ppa:lemonade-team/stable
sudo apt install lemonade-server
Enter fullscreen mode Exit fullscreen mode

Snap

sudo snap install lemonade-server
Enter fullscreen mode Exit fullscreen mode

macOS (beta)

Download the .pkg from the releases page.

Docker

docker run -p 13305:13305 lemonade-sdk/lemonade-server
Enter fullscreen mode Exit fullscreen mode

RPM (Fedora)

# Download the .rpm from the releases page
sudo rpm -i lemonade-server-10.3.0.x86_64.rpm
Enter fullscreen mode Exit fullscreen mode

Your first model in 3 commands

Note: as of v10.1, the CLI command is lemonade (the old lemonade-server CLI is deprecated).

# See which backends your hardware supports
lemonade recipes

# Download a model
lemonade pull Gemma-3-4b-it-GGUF

# Run it
lemonade run Gemma-3-4b-it-GGUF
Enter fullscreen mode Exit fullscreen mode

Other modalities work the same way:

# Image generation
lemonade run SDXL-Turbo

# Text-to-speech
lemonade run kokoro-v1

# Speech-to-text
lemonade run Whisper-Large-v3-Turbo
Enter fullscreen mode Exit fullscreen mode

Integrating with your app

Because Lemonade exposes an OpenAI-compatible API, you swap it in with a single config change. The base URL as of v10.1 is http://localhost:13305/v1.

Python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:13305/v1",
    api_key="lemonade"  # required by the library, but unused by Lemonade
)

response = client.chat.completions.create(
    model="Gemma-3-4b-it-GGUF",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Summarize the benefits of running AI locally."}
    ]
)

print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

Node.js

import OpenAI from "openai";

const client = new OpenAI({
    baseURL: "http://localhost:13305/v1",
    apiKey: "lemonade",
});

const response = await client.chat.completions.create({
    model: "Gemma-3-4b-it-GGUF",
    messages: [{ role: "user", content: "Hello from Node.js" }],
});

console.log(response.choices[0].message.content);
Enter fullscreen mode Exit fullscreen mode

Lemonade also supports the Ollama API and the Anthropic API for apps that use those clients natively.


Supported hardware

Hardware Backend Notes
AMD Radeon RDNA3 / RDNA4 GPU ROCm RX 7000/9000 series, Radeon PRO
Ryzen AI MAX (Strix Halo) ROCm (gfx1151) Windows and Ubuntu
Any Vulkan-capable GPU Vulkan (llamacpp) Broad compatibility
AMD Ryzen AI NPU (XDNA2) FLM / FastFlowLM Windows and Linux (beta)
Any x86_64 CPU CPU Universally available, slower
Apple Silicon Metal macOS beta

Not sure what your machine supports? Run:

lemonade recipes
Enter fullscreen mode Exit fullscreen mode

It auto-detects your hardware and lists exactly which backends are available.


Embeddable Lemonade

Since v10.2, you can bundle Lemonade as a portable binary inside your own application. Your users get local multi-modal AI without ever seeing a Lemonade installer or any Lemonade branding.

# Run lemond as a subprocess from your app
lemond ./
Enter fullscreen mode Exit fullscreen mode

Full guide: lemonade-server.ai/docs/embeddable/

This is particularly useful for desktop app developers who want to ship local AI features without taking on the complexity of packaging inference engines themselves.


App integrations

Lemonade has a growing marketplace of first-class integrations. Highlights include:

  • VS Code GitHub Copilot — use local models for code completions
  • Claude Codelemonade launch claude wires it up natively (added in v10.0)
  • Open WebUI — a polished ChatGPT-style UI, running locally
  • Continue — local AI coding assistant for VS Code and JetBrains
  • n8n — automate workflows with local AI nodes
  • OpenHands — local AI agent for software engineering tasks
  • Dify — LLM app building platform
  • AnythingLLM — local knowledge base with RAG

OmniRouter: the key new API concept in v10.3

OmniRouter is worth calling out separately because it changes how you think about the API surface. Previously you would call separate endpoints for text, image, and speech. With OmniRouter, you interact through a single multi-modal endpoint and Lemonade routes each request to the correct backend engine automatically.

This means you can build agentic pipelines — for example, a loop that generates text, converts it to speech, and produces an image — all through one unified client without managing multiple base URLs or backend configurations.


How it compares to Ollama and LM Studio

These tools all solve similar problems. Where Lemonade stands out:

  • NPU support — one of the very few tools that accelerates inference on the AMD XDNA2 NPU in Ryzen AI laptops
  • True multi-modal in one server — text, images, speech-to-text, and TTS from a single API endpoint
  • OmniRouter — automatic multi-modal routing without manual backend wiring
  • Embeddable binary — package it inside your own app with no Lemonade branding
  • Multiple API standards — OpenAI, Anthropic, and Ollama APIs simultaneously
  • AMD-first optimizations — deep ROCm integration and NPU tooling maintained by AMD engineers

If you are on NVIDIA with a mainstream GPU and only need text generation, Ollama is slightly simpler to get started. For AMD hardware, AI PCs with NPUs, or multi-modal workloads, Lemonade covers more ground.


Quick reference

Resource Link
GitHub github.com/lemonade-sdk/lemonade

Top comments (0)