Pavel Espitia

Posted on Apr 17 • Edited on May 5

How to Run LLMs Locally with Ollama — A Developer's Guide

#ai #tutorial #beginners #typescript

You don't need an API key or a cloud subscription to use LLMs. Ollama lets you run models locally on your machine — completely free, completely private. Here's how to set it up and start building with it.

What is Ollama?

Ollama is a tool that downloads, manages, and serves LLMs locally. It exposes an OpenAI-compatible API at localhost:11434, so any code that works with the OpenAI API works with Ollama — zero changes.

Installation

# Linux / WSL
curl -fsSL https://ollama.com/install.sh | sh

# macOS
brew install ollama

# Windows
# Download from https://ollama.com/download

Start the server:

ollama serve

Pick a Model

# Code-focused (best for dev tools)
ollama pull qwen2.5-coder:7b      # 4.7GB, good balance
ollama pull qwen2.5-coder:1.5b    # 1.0GB, fast, good enough for many tasks
ollama pull deepseek-coder-v2      # 8.9GB, top quality

# General purpose
ollama pull llama3.1:8b            # 4.7GB, Meta's latest
ollama pull mistral:7b             # 4.1GB, fast and capable

My recommendation: start with qwen2.5-coder:1.5b for speed, upgrade to 7b when you need quality.

Your First API Call

Ollama serves an OpenAI-compatible endpoint. Here's a call with plain fetch:

const response = await fetch("http://localhost:11434/v1/chat/completions", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "qwen2.5-coder:7b",
    messages: [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user", content: "Explain what a closure is in JavaScript." },
    ],
    temperature: 0,
    stream: false,
  }),
});

const data = await response.json();
console.log(data.choices[0].message.content);

That's it. No API key, no SDK, no account.

Structured Output (JSON Mode)

The key to building real tools with LLMs is getting structured output. Tell the model to respond with JSON:

const response = await fetch("http://localhost:11434/v1/chat/completions", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "qwen2.5-coder:7b",
    messages: [
      {
        role: "system",
        content: `Respond with ONLY valid JSON matching this schema:
        { "summary": "string", "topics": ["string"], "difficulty": "beginner|intermediate|advanced" }`,
      },
      {
        role: "user",
        content: "Analyze this article topic: Building REST APIs with Express.js",
      },
    ],
    temperature: 0,
    stream: false,
  }),
});

Tip: always validate the response with Zod or a similar schema validator. Smaller models sometimes return invalid JSON.

Building a Provider Abstraction

If you want your app to work with both Ollama (local) and Claude/OpenAI (cloud), create a simple interface:

interface LlmProvider {
  chat(system: string, messages: Message[]): Promise<string>;
}

class OllamaProvider implements LlmProvider {
  constructor(private model: string) {}

  async chat(system: string, messages: Message[]): Promise<string> {
    const response = await fetch("http://localhost:11434/v1/chat/completions", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        model: this.model,
        messages: [{ role: "system", content: system }, ...messages],
        temperature: 0,
        stream: false,
      }),
    });
    const data = await response.json();
    return data.choices[0].message.content;
  }
}

Now your code doesn't care where the model runs. Swap OllamaProvider for AnthropicProvider with a flag.

Performance Tips

First call is slow — the model loads into memory. Subsequent calls are fast.
Keep the server running — don't start/stop per request.
Use smaller models for dev — 1.5b for iteration, 7b for production quality.
Set temperature: 0 for deterministic output (important for structured responses).
Add a timeout — local models on CPU can take minutes for long prompts.

When to Use Local vs Cloud

Use Case	Local (Ollama)	Cloud (Claude/GPT)
Development	Great	Expensive
Privacy-sensitive data	Required	Risky
Production quality	Good (7b+)	Best
Speed	Depends on hardware	Fast
Cost	Free	Per-token

What I Built With It

spectr-ai — an AI smart contract auditor that works with both Claude and Ollama. The --model ollama:qwen2.5-coder:1.5b flag runs everything locally, free, no API key.

Local LLMs are good enough for real developer tools. The quality gap is closing fast.

If you enjoy dissecting why systems break down, I make video case studies of historical engineering disasters at Why It Crashed — same first-principles approach, different domain. Latest: how five words on a foggy radio call killed 583 people on a runway in 1977.

DEV Community