DEV Community

Alex Spinov
Alex Spinov

Posted on

LM Studio Has a Free API: Run LLMs Locally with OpenAI-Compatible Endpoint

LM Studio lets you run large language models locally on your computer with a beautiful GUI and an OpenAI-compatible API server. No cloud costs, no data leaving your machine, full privacy.

What Is LM Studio?

LM Studio is a desktop application for discovering, downloading, and running local LLMs. It supports GGUF models from Hugging Face and provides an OpenAI-compatible API endpoint, making it a drop-in replacement for OpenAI in your applications.

Key Features:

  • OpenAI-compatible local API server
  • GPU acceleration (CUDA, Metal, Vulkan)
  • GGUF model format support
  • Model discovery from Hugging Face
  • Chat UI with conversation history
  • Multi-model loading
  • Quantization support (Q4, Q5, Q8)

Getting Started

  1. Download LM Studio from lmstudio.ai
  2. Search and download a model (e.g., Llama 3, Mistral, Phi-3)
  3. Load the model and start the local server
  4. Use the API at http://localhost:1234

LM Studio API: OpenAI-Compatible Endpoint

from openai import OpenAI

# Point to local LM Studio server
client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio"  # Any string works
)

# Chat completion (same as OpenAI!)
response = client.chat.completions.create(
    model="llama-3.2-3b-instruct",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to find prime numbers"}
    ],
    temperature=0.7,
    max_tokens=500
)
print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

Streaming Responses

stream = client.chat.completions.create(
    model="llama-3.2-3b-instruct",
    messages=[{"role": "user", "content": "Explain Docker in simple terms"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
Enter fullscreen mode Exit fullscreen mode

Embeddings API

# Generate embeddings locally
embeddings = client.embeddings.create(
    model="nomic-embed-text-v1.5",
    input=["Kubernetes orchestrates containers", "Docker packages applications"]
)

for i, emb in enumerate(embeddings.data):
    print(f"Text {i}: {len(emb.embedding)} dimensions")
Enter fullscreen mode Exit fullscreen mode

Using with LangChain

from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

llm = ChatOpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio",
    model="llama-3.2-3b-instruct"
)

template = PromptTemplate(
    input_variables=["topic"],
    template="Write a technical blog outline about {topic}"
)

chain = LLMChain(llm=llm, prompt=template)
result = chain.run(topic="web scraping best practices")
print(result)
Enter fullscreen mode Exit fullscreen mode

REST API Direct Access

# Chat completion
curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d x27{"model": "llama-3.2-3b-instruct", "messages": [{"role": "user", "content": "Hello!"}]}x27

# List loaded models
curl http://localhost:1234/v1/models

# Embeddings
curl http://localhost:1234/v1/embeddings \
  -H "Content-Type: application/json" \
  -d x27{"model": "nomic-embed-text-v1.5", "input": "Hello world"}x27
Enter fullscreen mode Exit fullscreen mode

LM Studio vs Ollama

Feature LM Studio Ollama
GUI Full desktop app CLI only
API OpenAI-compatible OpenAI-compatible
Model format GGUF GGUF (Modelfile)
Model discovery In-app HF browser ollama.com library
Multi-model Yes Yes
OS Win/Mac/Linux Win/Mac/Linux

Resources


Need to scrape web data to feed your local LLMs? Check out my web scraping tools on Apify — production-ready actors for Reddit, Google Maps, and more. Questions? Email me at spinov001@gmail.com

Top comments (0)