OpenAI charges per token. Anthropic charges per token. Ollama lets you run Llama 3, Mistral, Gemma, and other LLMs on YOUR machine — completely free, offline, no API keys needed.
What Ollama Gives You for Free
- Run LLMs locally — Llama 3, Mistral, Gemma, Phi, Code Llama
-
One-command install —
ollama run llama3and it works - OpenAI-compatible API — swap OpenAI for Ollama in existing code
- GPU acceleration — NVIDIA, AMD, Apple Silicon
- No internet needed — fully offline after download
- Custom models — import GGUF, create Modelfiles
- Multi-model — run multiple models simultaneously
Quick Start
# Install (macOS)
brew install ollama
# Or Linux
curl -fsSL https://ollama.com/install.sh | sh
# Run a model (downloads automatically)
ollama run llama3
# Chat!
>>> What is the capital of France?
The capital of France is Paris.
Available Models
# Large models (for powerful machines)
ollama run llama3.1:70b # 70B params, needs 40GB+ RAM
ollama run mixtral # MoE, great quality
ollama run command-r-plus # 104B, excellent reasoning
# Medium models (good balance)
ollama run llama3.1 # 8B, great all-rounder
ollama run mistral # 7B, fast and capable
ollama run gemma2 # 9B, Google's best small model
# Small/fast models
ollama run phi3 # 3.8B, surprisingly capable
ollama run gemma2:2b # 2B, runs on anything
# Code models
ollama run codellama # Code generation
ollama run deepseek-coder-v2 # Best OSS code model
ollama run starcoder2 # Fast code completion
# Embedding models
ollama run nomic-embed-text # For RAG applications
OpenAI-Compatible API
// Your existing OpenAI code works with ONE change:
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:11434/v1', // ← Just change this
apiKey: 'ollama' // ← Required but unused
});
const response = await client.chat.completions.create({
model: 'llama3.1',
messages: [{ role: 'user', content: 'Explain quantum computing' }],
stream: true
});
for await (const chunk of response) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? '');
}
REST API
# Generate
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "Why is the sky blue?",
"stream": false
}'
# Chat
curl http://localhost:11434/api/chat -d '{
"model": "llama3.1",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false
}'
# Embeddings
curl http://localhost:11434/api/embeddings -d '{
"model": "nomic-embed-text",
"prompt": "The quick brown fox"
}'
Custom Models (Modelfile)
# Modelfile
FROM llama3.1
SYSTEM "You are a senior software engineer. Give concise, practical answers with code examples."
PARAMETER temperature 0.3
PARAMETER top_p 0.9
ollama create coding-assistant -f Modelfile
ollama run coding-assistant
Use With LangChain
from langchain_community.llms import Ollama
llm = Ollama(model="llama3.1")
result = llm.invoke("Write a Python function to sort a list")
print(result)
Performance by Hardware
| Hardware | Llama 3 8B | Mistral 7B | Phi-3 3.8B |
|---|---|---|---|
| M1 Mac (8GB) | 15 tok/s | 18 tok/s | 30 tok/s |
| M2 Pro (16GB) | 35 tok/s | 40 tok/s | 60 tok/s |
| RTX 3080 | 50 tok/s | 55 tok/s | 80 tok/s |
| RTX 4090 | 100 tok/s | 110 tok/s | 150 tok/s |
Ollama vs OpenAI vs Anthropic
| Aspect | Ollama | OpenAI | Anthropic |
|---|---|---|---|
| Cost | Free | $2-60/1M tokens | $3-75/1M tokens |
| Privacy | 100% local | Cloud | Cloud |
| Internet | Not needed | Required | Required |
| Quality (best) | Good (70B) | Excellent | Excellent |
| Speed | Hardware-dependent | Fast | Fast |
| Customization | Full (Modelfile) | Fine-tuning ($) | None |
The Verdict
Ollama makes running LLMs locally as easy as ollama run llama3. Free, private, offline-capable, and compatible with the OpenAI API. For development, testing, and privacy-sensitive applications, Ollama is essential.
Need help building AI-powered data pipelines or web scrapers? I build custom solutions. Reach out: spinov001@gmail.com
Check out my awesome-web-scraping collection — 400+ tools for extracting web data.
Top comments (0)