What is Ollama?
Ollama is a free, open-source tool that lets you run powerful AI models on your own computer. No cloud accounts, no API keys, no per-token fees — just download a model and start chatting. It supports models like Llama 3, DeepSeek, Qwen, Gemma, Mistral, and dozens more.
Whether you want complete data privacy, zero-cost AI inference, or the ability to work offline, Ollama makes local AI deployment as simple as a single command.
Installation
macOS
brew install ollama
Or download from ollama.com/download. Requires macOS 11+.
Windows
winget install Ollama.Ollama
Or download the installer from ollama.com/download. Requires Windows 10+.
Linux
curl -fsSL https://ollama.com/install.sh | sh
Docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
Your First Model in 30 Seconds
After installation, running a model is just one command:
ollama run llama3.1
Ollama automatically downloads the model (about 4.7 GB for Llama 3.1 8B) and starts an interactive chat. That’s it — you’re running a large language model locally.
Popular Models
| Model | Sizes | Best For | Download Size (Q4) |
|---|---|---|---|
| Llama 3.1 (Meta) | 8B, 70B, 405B | General purpose | ~4.7 GB (8B) |
| DeepSeek-R1 | 1.5B – 671B | Reasoning, math, coding | ~4.7 GB (7B) |
| Qwen 3 (Alibaba) | 0.6B – 235B | Multilingual, 128K context | ~4.7 GB (7B) |
| Gemma 3 (Google) | 1B – 27B | Balanced performance | ~2.5 GB (4B) |
| Mistral | 7B | General purpose | ~4.1 GB |
| Phi-3 (Microsoft) | 3B, 14B | Lightweight, coding | ~2.2 GB (3B) |
| Gemma 4 (Google) | 2B – 31B | Latest gen, MoE | ~1.5 GB (2B) |
Essential Commands
# Download a model
ollama pull llama3.1
# Start chatting
ollama run llama3.1
# One-shot prompt (no interactive mode)
ollama run llama3.1 "Explain Docker in one paragraph"
# List downloaded models
ollama list
# Show running models and memory usage
ollama ps
# Delete a model
ollama rm llama3.1
Hardware Requirements
The model size determines how much RAM or VRAM you need:
| Model Size | Minimum RAM | Recommended RAM | GPU VRAM (Q4) |
|---|---|---|---|
| 1B – 3B | 4 GB | 8 GB | 2-4 GB |
| 7B – 8B | 8 GB | 16 GB | 5-6 GB |
| 13B – 14B | 16 GB | 32 GB | 8-10 GB |
| 27B – 32B | 32 GB | 48 GB | 16-20 GB |
| 70B | 64 GB | 96 GB | 38-45 GB |
Key points:
- Ollama runs on CPU if no GPU is detected — just 5-10x slower
- Apple Silicon Macs use unified memory — a 16 GB M1/M2 runs 7B-8B models well
- NVIDIA GPUs with 6+ GB VRAM get significant speedups
- AMD/Intel GPUs supported via Vulkan backend
Using the API (Python)
Ollama provides a fully OpenAI-compatible API at http://localhost:11434/v1. Any code written for the OpenAI SDK works with Ollama by just changing the base URL.
With the OpenAI Python SDK
pip install openai
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required by SDK but not used
)
response = client.chat.completions.create(
model="llama3.1",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
]
)
print(response.choices[0].message.content)
Streaming
stream = client.chat.completions.create(
model="llama3.1",
messages=[{"role": "user", "content": "Write a short poem about coding"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
With the Native Ollama Library
pip install ollama
import ollama
response = ollama.chat(
model="llama3.1",
messages=[{"role": "user", "content": "Explain Docker in one paragraph"}]
)
print(response["message"]["content"])
Connect Ollama to OpenClaw (Free Local AI Agent)
Combine Ollama with OpenClaw to build a fully local, zero-cost AI agent that can browse the web, manage files, and chat with you on WhatsApp or Telegram — all running on your own hardware with complete data privacy.
Setup
# 1. Make sure Ollama is running with a model
ollama pull llama3.1
# 2. Install and configure OpenClaw
npm install -g openclaw@latest
openclaw onboard
# Select "Ollama" from the provider list
OpenClaw auto-discovers your local models and connects to them. You get a fully functional AI agent — shell commands, file operations, browser control, scheduled tasks — all powered by your local model.
Performance Tips
Quantization: The Key to Running Big Models
Quantization compresses models to use less memory with minimal quality loss:
| Level | Quality Loss | VRAM per 1B Params | When to Use |
|---|---|---|---|
| Q4_K_M | ~1-3% | ~0.6 GB | Best balance (default, recommended) |
| Q5_K_M | Minimal | ~0.75 GB | When quality matters more |
| Q8_0 | Near-zero | ~1.0 GB | When accuracy is critical |
| FP16 | None | ~2.0 GB | Research only |
Other Optimization Tips
-
Monitor usage: Run
ollama psto check VRAM consumption -
Reduce KV cache memory:
OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve -
Keep models loaded:
OLLAMA_KEEP_ALIVE=-1prevents unloading between requests -
Concurrent requests:
OLLAMA_NUM_PARALLEL=4for handling multiple sessions - Smaller models can be faster: A 14B Q4 model with GPU beats a 70B model on CPU
Local Ollama vs Cloud APIs
| Factor | Ollama (Local) | Cloud APIs |
|---|---|---|
| Cost | $0 per token | Pay per token or monthly |
| Privacy | 100% local, no data leaves your machine | Data sent to servers |
| Latency | 10-50ms first token | 200-800ms |
| Model Quality | Good for 80% of tasks | Frontier models still superior |
| Offline | Works without internet | Internet required |
| Setup | Requires hardware knowledge | Simple API key |
| Scalability | Limited by hardware | Infinite |
Pro tip: Most developers in 2026 use a hybrid approach — Ollama for high-volume and sensitive tasks, cloud APIs for complex reasoning. This typically reduces cloud costs by 60-80%.
Which Model Should You Start With?
- 8 GB RAM, no GPU: Gemma 3 4B or Phi-3 Mini 3B
- 16 GB RAM or 8 GB VRAM: Llama 3.1 8B or Qwen 3 8B
- 32 GB RAM or 16 GB VRAM: DeepSeek-R1 14B or Gemma 3 27B
- 64+ GB RAM or 24+ GB VRAM: Llama 3.1 70B or Qwen 3 32B
Related Reads
- CrewAI vs AutoGPT vs LangGraph: Which Free Agent Framework Should You Use in 2026?
- n8n: Open-Source Workflow Automation with AI Agents and 400+ Integrations
- MCP (Model Context Protocol): Connect AI Agents to Any Tool or API
- Google NotebookLM: Free AI Research Tool for Summarizing Documents and PDFs
- Dify: Free Open-Source AI App Builder for Chatbots and Workflows
Final Thoughts
Ollama has made running AI models locally as simple as installing any other app. With one command, you can download and chat with models that rival cloud APIs — completely free and with full data privacy.
Whether you’re a developer building AI-powered apps, a privacy-conscious user who wants to keep data on-device, or someone who just wants to experiment with different models without paying per token, Ollama is the easiest way to get started with local AI.
Get started: ollama.com | Model Library | GitHub
Originally published at toolfreebie.com.
Top comments (0)