You know the drill. You build an AI-powered feature, ship it to production, and then the bill arrives. Or worse — your users' data is flowing through a third-party API you don't control. Privacy regulations tighten. API costs scale with usage. Latency adds up.
What if you could run the same models locally, on your own hardware, with an API that's drop-in compatible with OpenAI? That's exactly what AMD's Lemonade Server delivers — and it hit 516 points on Hacker News for good reason.
In this tutorial, I'll walk you through setting up Lemonade Server, running your first models, and integrating it into a real application.
Why Local AI Matters Now
Three trends are converging:
Privacy is non-negotiable. GDPR, HIPAA, and internal data policies increasingly require keeping data on-prem. Sending user prompts to OpenAI isn't always an option.
Cloud costs compound. A GPT-4-class API call costs pennies. Millions of calls cost thousands. If you're building internal tools, prototyping, or running batch workloads, those costs scale fast.
Hardware caught up. Modern GPUs and NPUs can run capable models locally. A mid-range machine with 16GB VRAM can handle most text generation tasks. AMD's NPU-equipped chips make it even more accessible.
Lemonade Server sits at the intersection of all three. It's a 2MB native C++ server that auto-configures for your hardware and exposes an OpenAI-compatible API. Let's get it running.
Installation and Setup
Prerequisites
- OS: Windows 10+, Linux (Ubuntu 22.04+), or macOS
- Hardware: Any GPU (AMD Radeon, NVIDIA, or Apple Silicon) or an NPU-equipped AMD Ryzen AI processor
- RAM: 16GB system RAM minimum; 32GB recommended for larger models
- Storage: ~10GB free for models
Step 1: Install Lemonade Server
The quickest way is the one-liner installer:
# Linux/macOS
curl -fsSL https://lemonade-server.ai/install.sh | bash
# Windows (PowerShell)
irm https://lemonade-server.ai/install.ps1 | iex
Or grab the binary directly from GitHub Releases:
# Linux
wget https://github.com/lemonade-sdk/lemonade-server/releases/latest/download/lemonade-server-linux.tar.gz
tar xzf lemonade-server-linux.tar.gz
sudo mv lemonade-server /usr/local/bin/
Verify the installation:
lemonade-server --version
# lemonade-server 0.4.2
Step 2: Start the Server
lemonade-server serve
You'll see output like:
╭─────────────────────────────────────────────╮
│ Lemonade Server v0.4.2 │
│ API: http://localhost:8000 │
│ Hardware: AMD Ryzen AI 9 HX 370 (NPU) │
│ Models: None loaded │
╰─────────────────────────────────────────────╯
Lemonade auto-detects your hardware and configures the optimal backend. If you have an NPU, it'll use that. If you have a GPU, it'll use that. No driver wrangling required.
Step 3: Pull and Run a Model
Lemonade uses a simple pull/run workflow similar to Docker:
# Pull a chat model
lemonade-server pull llama3.2:3b
# Run it
lemonade-server run llama3.2:3b
The model downloads once and stays cached locally. Subsequent runs start in under 2 seconds.
You can also load multiple models simultaneously:
lemonade-server pull phi3:mini
lemonade-server run phi3:mini --port 8001
Each model runs on its own port, so you can serve a chat model and an image model at the same time.
Using the OpenAI-Compatible API
This is where Lemonade shines. The API is a drop-in replacement for OpenAI's chat completions endpoint:
from openai import OpenAI
# Point to local server instead of OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # Lemonade doesn't require an API key
)
response = client.chat.completions.create(
model="llama3.2:3b",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to merge two sorted lists."}
]
)
print(response.choices[0].message.content)
That's it. If you have existing code using the OpenAI SDK, change the base_url and you're done. No code rewrite. No new SDK to learn.
Streaming Responses
stream = client.chat.completions.create(
model="llama3.2:3b",
messages=[{"role": "user", "content": "Explain quicksort"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
cURL Works Too
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Real-World Scenarios
Scenario 1: Private Code Review Bot
You want AI-assisted code review but can't send proprietary code to a cloud API:
import os
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="local")
def review_diff(diff_text: str) -> str:
prompt = f"""Review this code diff and flag potential issues:
- Security vulnerabilities
- Logic errors
- Style problems
Diff:
{diff_text}
"""
response = client.chat.completions.create(
model="llama3.2:3b",
messages=[{"role": "user", "content": prompt}],
temperature=0.3
)
return response.choices[0].message.content
# Use in your CI pipeline
with open("pr_diff.txt") as f:
review = review_diff(f.read())
print(review)
Your code never leaves the machine. Zero privacy concerns. Zero API costs.
Scenario 2: Batch Processing Without Rate Limits
Need to classify 100,000 support tickets? With a cloud API, you're fighting rate limits and racking up costs:
import json
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="local")
def classify_ticket(text: str) -> str:
response = client.chat.completions.create(
model="llama3.2:3b",
messages=[{
"role": "user",
"content": f"Classify this ticket as bug, feature, or question: {text}"
}],
temperature=0.0,
max_tokens=10
)
return response.choices[0].message.content.strip()
with open("tickets.jsonl") as f:
for line in f:
ticket = json.loads(line)
category = classify_ticket(ticket["text"])
ticket["category"] = category
print(json.dumps(ticket))
No rate limits. No per-token costs. Run it at 3 AM and wake up to classified tickets.
Scenario 3: Running Alongside a Cloud Fallback
Use local for speed and cost, fall back to cloud for quality:
from openai import OpenAI
local = OpenAI(base_url="http://localhost:8000/v1", api_key="local")
cloud = OpenAI() # Default OpenAI endpoint
def smart_chat(prompt: str, quality: str = "fast") -> str:
if quality == "fast":
client, model = local, "llama3.2:3b"
else:
client, model = cloud, "gpt-4o"
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
Supported Models
Lemonade supports a growing library of models across categories:
| Category | Models | Notes |
|---|---|---|
| Chat | Llama 3.2 (1B, 3B), Phi-3 Mini, Mistral 7B | Most popular, good general-purpose |
| Vision | Llama 3.2 Vision (11B, 90B) | Image understanding |
| Code | DeepSeek Coder, CodeLlama | Code generation & review |
| Embeddings | Nomic Embed, All-MiniLM | RAG and search |
| Speech | Whisper | Transcription |
Check the full list:
lemonade-server list
FAQ and Troubleshooting
Q: Do I need an AMD GPU?
No. Lemonade supports AMD Radeon, NVIDIA, Apple Silicon, and AMD NPUs. It auto-detects and uses whatever you have.
Q: How does performance compare to cloud APIs?
For smaller models (3B-7B parameters), local inference on a modern GPU achieves 30-80 tokens/second — comparable to or faster than cloud APIs when you account for network latency. Larger models (70B+) will be slower locally unless you have high-end hardware.
Q: Can I run Lemonade on a machine without a GPU?
Technically yes (CPU fallback exists), but performance will be poor. A cheap used GPU or an NPU-equipped laptop is worth it.
Q: The model download is slow. Can I pre-download?
Yes. Models are stored in ~/.lemonade/models/. You can copy them between machines or use lemonade-server pull on a fast connection and transfer the files.
Q: "Error: No suitable device found"
Make sure your GPU drivers are up to date. On Linux, verify with rocm-smi (AMD) or nvidia-smi (NVIDIA). On Windows, update through Device Manager or your GPU vendor's software.
Q: "Out of memory loading model"
Try a smaller model or reduce the context window:
lemonade-server run llama3.2:1b # Smaller model
lemonade-server run llama3.2:3b --ctx-size 2048 # Smaller context
Q: Can multiple applications use the same Lemonade instance?
Yes. The server handles multiple concurrent requests. Just point all your apps at http://localhost:8000/v1.
Conclusion
Lemonade Server fills a real gap in the AI tooling landscape. It's not trying to replace GPT-4 for complex reasoning — but for the 80% of AI workloads that are straightforward generation, classification, or extraction, running locally makes more sense than paying per token to a cloud provider.
The 2MB binary, hardware auto-detection, and OpenAI API compatibility mean you can go from zero to a working local AI server in under five minutes. And if you're already using the OpenAI SDK, migration is literally changing one URL.
If privacy, cost, or latency have been holding you back from adding AI to your applications, give Lemonade a try. Your data stays on your machine. Your budget stays in your pocket. Your code doesn't need to change.
Found this useful? Follow for more tutorials on local AI, developer tools, and practical engineering. Lemonade Server is open source under the MIT license — check it out on GitHub.
Top comments (0)