LM Studio Has a Free API: Run LLMs Locally with OpenAI-Compatible Endpoint

#ai #llm #machinelearning #tutorial

LM Studio lets you run large language models locally on your computer with a beautiful GUI and an OpenAI-compatible API server. No cloud costs, no data leaving your machine, full privacy.

What Is LM Studio?

LM Studio is a desktop application for discovering, downloading, and running local LLMs. It supports GGUF models from Hugging Face and provides an OpenAI-compatible API endpoint, making it a drop-in replacement for OpenAI in your applications.

Key Features:

OpenAI-compatible local API server
GPU acceleration (CUDA, Metal, Vulkan)
GGUF model format support
Model discovery from Hugging Face
Chat UI with conversation history
Multi-model loading
Quantization support (Q4, Q5, Q8)

Getting Started

Download LM Studio from lmstudio.ai
Search and download a model (e.g., Llama 3, Mistral, Phi-3)
Load the model and start the local server
Use the API at http://localhost:1234

LM Studio API: OpenAI-Compatible Endpoint

from openai import OpenAI

# Point to local LM Studio server
client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio"  # Any string works
)

# Chat completion (same as OpenAI!)
response = client.chat.completions.create(
    model="llama-3.2-3b-instruct",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to find prime numbers"}
    ],
    temperature=0.7,
    max_tokens=500
)
print(response.choices[0].message.content)

Streaming Responses

stream = client.chat.completions.create(
    model="llama-3.2-3b-instruct",
    messages=[{"role": "user", "content": "Explain Docker in simple terms"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Embeddings API

# Generate embeddings locally
embeddings = client.embeddings.create(
    model="nomic-embed-text-v1.5",
    input=["Kubernetes orchestrates containers", "Docker packages applications"]
)

for i, emb in enumerate(embeddings.data):
    print(f"Text {i}: {len(emb.embedding)} dimensions")

Using with LangChain

from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

llm = ChatOpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio",
    model="llama-3.2-3b-instruct"
)

template = PromptTemplate(
    input_variables=["topic"],
    template="Write a technical blog outline about {topic}"
)

chain = LLMChain(llm=llm, prompt=template)
result = chain.run(topic="web scraping best practices")
print(result)

REST API Direct Access

# Chat completion
curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d x27{"model": "llama-3.2-3b-instruct", "messages": [{"role": "user", "content": "Hello!"}]}x27

# List loaded models
curl http://localhost:1234/v1/models

# Embeddings
curl http://localhost:1234/v1/embeddings \
  -H "Content-Type: application/json" \
  -d x27{"model": "nomic-embed-text-v1.5", "input": "Hello world"}x27

LM Studio vs Ollama

Feature	LM Studio	Ollama
GUI	Full desktop app	CLI only
API	OpenAI-compatible	OpenAI-compatible
Model format	GGUF	GGUF (Modelfile)
Model discovery	In-app HF browser	ollama.com library
Multi-model	Yes	Yes
OS	Win/Mac/Linux	Win/Mac/Linux

Resources

Need to scrape web data to feed your local LLMs? Check out my web scraping tools on Apify — production-ready actors for Reddit, Google Maps, and more. Questions? Email me at spinov001@gmail.com

DEV Community