DEV Community

Alex Spinov
Alex Spinov

Posted on

Ollama Has a Free API — Run LLMs Locally with One Command

TL;DR

Ollama lets you run large language models locally on your machine. One command to download and run Llama 3, Mistral, Gemma, Phi, and 100+ models — with an OpenAI-compatible API.

What Is Ollama?

Ollama makes local AI simple:

  • One commandollama run llama3 and you're chatting
  • 100+ models — Llama 3, Mistral, Gemma, Phi, CodeLlama, etc.
  • OpenAI-compatible API — drop-in replacement at localhost:11434
  • GPU acceleration — NVIDIA, AMD, Apple Silicon
  • Model customization — Modelfiles for custom system prompts
  • Free — MIT license, runs on your hardware

Quick Start

# Install
curl -fsSL https://ollama.com/install.sh | sh
# Or: brew install ollama

# Run a model (auto-downloads)
ollama run llama3.1

# Run smaller models for faster responses
ollama run phi3        # 3.8B — fast, good for coding
ollama run mistral     # 7B — great general purpose
ollama run gemma2      # 9B — Google's model
ollama run codellama   # For code generation
Enter fullscreen mode Exit fullscreen mode

REST API

# Chat completion
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [{"role": "user", "content": "Why is the sky blue?"}],
  "stream": false
}'

# Generate (simple completion)
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Write a haiku about programming",
  "stream": false
}'

# List local models
curl http://localhost:11434/api/tags
Enter fullscreen mode Exit fullscreen mode

OpenAI-Compatible API

import OpenAI from "openai";

// Point to Ollama instead of OpenAI
const client = new OpenAI({
  baseURL: "http://localhost:11434/v1",
  apiKey: "ollama", // required but unused
});

const response = await client.chat.completions.create({
  model: "llama3.1",
  messages: [
    { role: "system", content: "You are a helpful coding assistant." },
    { role: "user", content: "Write a Python function to sort a list" },
  ],
});

console.log(response.choices[0].message.content);
Enter fullscreen mode Exit fullscreen mode

Python

import ollama

response = ollama.chat(
    model="llama3.1",
    messages=[{"role": "user", "content": "Explain recursion simply"}],
)
print(response["message"]["content"])

# Streaming
for chunk in ollama.chat(
    model="llama3.1",
    messages=[{"role": "user", "content": "Write a poem"}],
    stream=True,
):
    print(chunk["message"]["content"], end="")
Enter fullscreen mode Exit fullscreen mode

Custom Models (Modelfile)

# Modelfile
FROM llama3.1

SYSTEM You are a senior Python developer. You write clean, efficient code with type hints. Always include docstrings and tests.

PARAMETER temperature 0.3
PARAMETER top_p 0.9
Enter fullscreen mode Exit fullscreen mode
ollama create python-expert -f Modelfile
ollama run python-expert
Enter fullscreen mode Exit fullscreen mode

Model Recommendations

Model Size RAM Needed Best For
phi3 3.8B 4 GB Fast responses, coding
mistral 7B 8 GB General purpose
llama3.1 8B 8 GB Best open model
gemma2 9B 8 GB Instruction following
codellama 13B 16 GB Code generation
llama3.1:70b 70B 48 GB Near GPT-4 quality

Ollama vs Alternatives

Feature Ollama LM Studio GPT4All llama.cpp
Setup 1 command GUI install GUI install Compile
API REST + OpenAI compat OpenAI compat API None
Models 100+ (auto-download) HuggingFace Curated Manual GGUF
GPU support NVIDIA/AMD/Apple NVIDIA/Apple NVIDIA/Apple All
Docker Official image No No Community
CLI Excellent No No Yes

Resources


Running AI locally on scraped data? My Apify tools extract web data — process it locally with Ollama for private, cost-free AI analysis. Questions? Email spinov001@gmail.com

Top comments (0)