DEV Community

Cover image for How to Run AI Locally with Lemonade Server: No Cloud, No API Keys, No Problem
AttractivePenguin
AttractivePenguin

Posted on

How to Run AI Locally with Lemonade Server: No Cloud, No API Keys, No Problem

You know the drill. You build an AI-powered feature, ship it to production, and then the bill arrives. Or worse — your users' data is flowing through a third-party API you don't control. Privacy regulations tighten. API costs scale with usage. Latency adds up.

What if you could run the same models locally, on your own hardware, with an API that's drop-in compatible with OpenAI? That's exactly what AMD's Lemonade Server delivers — and it hit 516 points on Hacker News for good reason.

In this tutorial, I'll walk you through setting up Lemonade Server, running your first models, and integrating it into a real application.

Why Local AI Matters Now

Three trends are converging:

  1. Privacy is non-negotiable. GDPR, HIPAA, and internal data policies increasingly require keeping data on-prem. Sending user prompts to OpenAI isn't always an option.

  2. Cloud costs compound. A GPT-4-class API call costs pennies. Millions of calls cost thousands. If you're building internal tools, prototyping, or running batch workloads, those costs scale fast.

  3. Hardware caught up. Modern GPUs and NPUs can run capable models locally. A mid-range machine with 16GB VRAM can handle most text generation tasks. AMD's NPU-equipped chips make it even more accessible.

Lemonade Server sits at the intersection of all three. It's a 2MB native C++ server that auto-configures for your hardware and exposes an OpenAI-compatible API. Let's get it running.

Installation and Setup

Prerequisites

  • OS: Windows 10+, Linux (Ubuntu 22.04+), or macOS
  • Hardware: Any GPU (AMD Radeon, NVIDIA, or Apple Silicon) or an NPU-equipped AMD Ryzen AI processor
  • RAM: 16GB system RAM minimum; 32GB recommended for larger models
  • Storage: ~10GB free for models

Step 1: Install Lemonade Server

The quickest way is the one-liner installer:

# Linux/macOS
curl -fsSL https://lemonade-server.ai/install.sh | bash

# Windows (PowerShell)
irm https://lemonade-server.ai/install.ps1 | iex
Enter fullscreen mode Exit fullscreen mode

Or grab the binary directly from GitHub Releases:

# Linux
wget https://github.com/lemonade-sdk/lemonade-server/releases/latest/download/lemonade-server-linux.tar.gz
tar xzf lemonade-server-linux.tar.gz
sudo mv lemonade-server /usr/local/bin/
Enter fullscreen mode Exit fullscreen mode

Verify the installation:

lemonade-server --version
# lemonade-server 0.4.2
Enter fullscreen mode Exit fullscreen mode

Step 2: Start the Server

lemonade-server serve
Enter fullscreen mode Exit fullscreen mode

You'll see output like:

╭─────────────────────────────────────────────╮
│  Lemonade Server v0.4.2                      │
│  API: http://localhost:8000                   │
│  Hardware: AMD Ryzen AI 9 HX 370 (NPU)      │
│  Models: None loaded                          │
╰─────────────────────────────────────────────╯
Enter fullscreen mode Exit fullscreen mode

Lemonade auto-detects your hardware and configures the optimal backend. If you have an NPU, it'll use that. If you have a GPU, it'll use that. No driver wrangling required.

Step 3: Pull and Run a Model

Lemonade uses a simple pull/run workflow similar to Docker:

# Pull a chat model
lemonade-server pull llama3.2:3b

# Run it
lemonade-server run llama3.2:3b
Enter fullscreen mode Exit fullscreen mode

The model downloads once and stays cached locally. Subsequent runs start in under 2 seconds.

You can also load multiple models simultaneously:

lemonade-server pull phi3:mini
lemonade-server run phi3:mini --port 8001
Enter fullscreen mode Exit fullscreen mode

Each model runs on its own port, so you can serve a chat model and an image model at the same time.

Using the OpenAI-Compatible API

This is where Lemonade shines. The API is a drop-in replacement for OpenAI's chat completions endpoint:

from openai import OpenAI

# Point to local server instead of OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # Lemonade doesn't require an API key
)

response = client.chat.completions.create(
    model="llama3.2:3b",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to merge two sorted lists."}
    ]
)

print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

That's it. If you have existing code using the OpenAI SDK, change the base_url and you're done. No code rewrite. No new SDK to learn.

Streaming Responses

stream = client.chat.completions.create(
    model="llama3.2:3b",
    messages=[{"role": "user", "content": "Explain quicksort"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
Enter fullscreen mode Exit fullscreen mode

cURL Works Too

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
Enter fullscreen mode Exit fullscreen mode

Real-World Scenarios

Scenario 1: Private Code Review Bot

You want AI-assisted code review but can't send proprietary code to a cloud API:

import os
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="local")

def review_diff(diff_text: str) -> str:
    prompt = f"""Review this code diff and flag potential issues:
    - Security vulnerabilities
    - Logic errors
    - Style problems

    Diff:
    {diff_text}
    """

    response = client.chat.completions.create(
        model="llama3.2:3b",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3
    )
    return response.choices[0].message.content

# Use in your CI pipeline
with open("pr_diff.txt") as f:
    review = review_diff(f.read())
    print(review)
Enter fullscreen mode Exit fullscreen mode

Your code never leaves the machine. Zero privacy concerns. Zero API costs.

Scenario 2: Batch Processing Without Rate Limits

Need to classify 100,000 support tickets? With a cloud API, you're fighting rate limits and racking up costs:

import json
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="local")

def classify_ticket(text: str) -> str:
    response = client.chat.completions.create(
        model="llama3.2:3b",
        messages=[{
            "role": "user", 
            "content": f"Classify this ticket as bug, feature, or question: {text}"
        }],
        temperature=0.0,
        max_tokens=10
    )
    return response.choices[0].message.content.strip()

with open("tickets.jsonl") as f:
    for line in f:
        ticket = json.loads(line)
        category = classify_ticket(ticket["text"])
        ticket["category"] = category
        print(json.dumps(ticket))
Enter fullscreen mode Exit fullscreen mode

No rate limits. No per-token costs. Run it at 3 AM and wake up to classified tickets.

Scenario 3: Running Alongside a Cloud Fallback

Use local for speed and cost, fall back to cloud for quality:

from openai import OpenAI

local = OpenAI(base_url="http://localhost:8000/v1", api_key="local")
cloud = OpenAI()  # Default OpenAI endpoint

def smart_chat(prompt: str, quality: str = "fast") -> str:
    if quality == "fast":
        client, model = local, "llama3.2:3b"
    else:
        client, model = cloud, "gpt-4o"

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

Supported Models

Lemonade supports a growing library of models across categories:

Category Models Notes
Chat Llama 3.2 (1B, 3B), Phi-3 Mini, Mistral 7B Most popular, good general-purpose
Vision Llama 3.2 Vision (11B, 90B) Image understanding
Code DeepSeek Coder, CodeLlama Code generation & review
Embeddings Nomic Embed, All-MiniLM RAG and search
Speech Whisper Transcription

Check the full list:

lemonade-server list
Enter fullscreen mode Exit fullscreen mode

FAQ and Troubleshooting

Q: Do I need an AMD GPU?

No. Lemonade supports AMD Radeon, NVIDIA, Apple Silicon, and AMD NPUs. It auto-detects and uses whatever you have.

Q: How does performance compare to cloud APIs?

For smaller models (3B-7B parameters), local inference on a modern GPU achieves 30-80 tokens/second — comparable to or faster than cloud APIs when you account for network latency. Larger models (70B+) will be slower locally unless you have high-end hardware.

Q: Can I run Lemonade on a machine without a GPU?

Technically yes (CPU fallback exists), but performance will be poor. A cheap used GPU or an NPU-equipped laptop is worth it.

Q: The model download is slow. Can I pre-download?

Yes. Models are stored in ~/.lemonade/models/. You can copy them between machines or use lemonade-server pull on a fast connection and transfer the files.

Q: "Error: No suitable device found"

Make sure your GPU drivers are up to date. On Linux, verify with rocm-smi (AMD) or nvidia-smi (NVIDIA). On Windows, update through Device Manager or your GPU vendor's software.

Q: "Out of memory loading model"

Try a smaller model or reduce the context window:

lemonade-server run llama3.2:1b  # Smaller model
lemonade-server run llama3.2:3b --ctx-size 2048  # Smaller context
Enter fullscreen mode Exit fullscreen mode

Q: Can multiple applications use the same Lemonade instance?

Yes. The server handles multiple concurrent requests. Just point all your apps at http://localhost:8000/v1.

Conclusion

Lemonade Server fills a real gap in the AI tooling landscape. It's not trying to replace GPT-4 for complex reasoning — but for the 80% of AI workloads that are straightforward generation, classification, or extraction, running locally makes more sense than paying per token to a cloud provider.

The 2MB binary, hardware auto-detection, and OpenAI API compatibility mean you can go from zero to a working local AI server in under five minutes. And if you're already using the OpenAI SDK, migration is literally changing one URL.

If privacy, cost, or latency have been holding you back from adding AI to your applications, give Lemonade a try. Your data stays on your machine. Your budget stays in your pocket. Your code doesn't need to change.


Found this useful? Follow for more tutorials on local AI, developer tools, and practical engineering. Lemonade Server is open source under the MIT license — check it out on GitHub.

Top comments (0)