Ollama Has a Free API — Run LLMs Locally with One Command

#ollama #ai #llm #machinelearning

TL;DR

Ollama lets you run large language models locally on your machine. One command to download and run Llama 3, Mistral, Gemma, Phi, and 100+ models — with an OpenAI-compatible API.

What Is Ollama?

Ollama makes local AI simple:

One command — ollama run llama3 and you're chatting
100+ models — Llama 3, Mistral, Gemma, Phi, CodeLlama, etc.
OpenAI-compatible API — drop-in replacement at localhost:11434
GPU acceleration — NVIDIA, AMD, Apple Silicon
Model customization — Modelfiles for custom system prompts
Free — MIT license, runs on your hardware

Quick Start

# Install
curl -fsSL https://ollama.com/install.sh | sh
# Or: brew install ollama

# Run a model (auto-downloads)
ollama run llama3.1

# Run smaller models for faster responses
ollama run phi3        # 3.8B — fast, good for coding
ollama run mistral     # 7B — great general purpose
ollama run gemma2      # 9B — Google's model
ollama run codellama   # For code generation

REST API

# Chat completion
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [{"role": "user", "content": "Why is the sky blue?"}],
  "stream": false
}'

# Generate (simple completion)
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Write a haiku about programming",
  "stream": false
}'

# List local models
curl http://localhost:11434/api/tags

OpenAI-Compatible API

import OpenAI from "openai";

// Point to Ollama instead of OpenAI
const client = new OpenAI({
  baseURL: "http://localhost:11434/v1",
  apiKey: "ollama", // required but unused
});

const response = await client.chat.completions.create({
  model: "llama3.1",
  messages: [
    { role: "system", content: "You are a helpful coding assistant." },
    { role: "user", content: "Write a Python function to sort a list" },
  ],
});

console.log(response.choices[0].message.content);

Python

import ollama

response = ollama.chat(
    model="llama3.1",
    messages=[{"role": "user", "content": "Explain recursion simply"}],
)
print(response["message"]["content"])

# Streaming
for chunk in ollama.chat(
    model="llama3.1",
    messages=[{"role": "user", "content": "Write a poem"}],
    stream=True,
):
    print(chunk["message"]["content"], end="")

Custom Models (Modelfile)

# Modelfile
FROM llama3.1

SYSTEM You are a senior Python developer. You write clean, efficient code with type hints. Always include docstrings and tests.

PARAMETER temperature 0.3
PARAMETER top_p 0.9

ollama create python-expert -f Modelfile
ollama run python-expert

Model Recommendations

Model	Size	RAM Needed	Best For
phi3	3.8B	4 GB	Fast responses, coding
mistral	7B	8 GB	General purpose
llama3.1	8B	8 GB	Best open model
gemma2	9B	8 GB	Instruction following
codellama	13B	16 GB	Code generation
llama3.1:70b	70B	48 GB	Near GPT-4 quality

Ollama vs Alternatives

Feature	Ollama	LM Studio	GPT4All	llama.cpp
Setup	1 command	GUI install	GUI install	Compile
API	REST + OpenAI compat	OpenAI compat	API	None
Models	100+ (auto-download)	HuggingFace	Curated	Manual GGUF
GPU support	NVIDIA/AMD/Apple	NVIDIA/Apple	NVIDIA/Apple	All
Docker	Official image	No	No	Community
CLI	Excellent	No	No	Yes

Resources

Running AI locally on scraped data? My Apify tools extract web data — process it locally with Ollama for private, cost-free AI analysis. Questions? Email spinov001@gmail.com

DEV Community