toolfreebie

Posted on May 3 • Originally published at toolfreebie.com

Ollama: Run AI Models Locally for Free (Complete Setup Guide)

#ai #automation

What is Ollama?

Ollama is a free, open-source tool that lets you run powerful AI models on your own computer. No cloud accounts, no API keys, no per-token fees — just download a model and start chatting. It supports models like Llama 3, DeepSeek, Qwen, Gemma, Mistral, and dozens more.

Whether you want complete data privacy, zero-cost AI inference, or the ability to work offline, Ollama makes local AI deployment as simple as a single command.

Installation

macOS

brew install ollama

Or download from ollama.com/download. Requires macOS 11+.

Windows

winget install Ollama.Ollama

Or download the installer from ollama.com/download. Requires Windows 10+.

Linux

curl -fsSL https://ollama.com/install.sh | sh

Docker

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Your First Model in 30 Seconds

After installation, running a model is just one command:

ollama run llama3.1

Ollama automatically downloads the model (about 4.7 GB for Llama 3.1 8B) and starts an interactive chat. That’s it — you’re running a large language model locally.

Popular Models

Model	Sizes	Best For	Download Size (Q4)
Llama 3.1 (Meta)	8B, 70B, 405B	General purpose	~4.7 GB (8B)
DeepSeek-R1	1.5B – 671B	Reasoning, math, coding	~4.7 GB (7B)
Qwen 3 (Alibaba)	0.6B – 235B	Multilingual, 128K context	~4.7 GB (7B)
Gemma 3 (Google)	1B – 27B	Balanced performance	~2.5 GB (4B)
Mistral	7B	General purpose	~4.1 GB
Phi-3 (Microsoft)	3B, 14B	Lightweight, coding	~2.2 GB (3B)
Gemma 4 (Google)	2B – 31B	Latest gen, MoE	~1.5 GB (2B)

Essential Commands

# Download a model
ollama pull llama3.1

# Start chatting
ollama run llama3.1

# One-shot prompt (no interactive mode)
ollama run llama3.1 "Explain Docker in one paragraph"

# List downloaded models
ollama list

# Show running models and memory usage
ollama ps

# Delete a model
ollama rm llama3.1

Hardware Requirements

The model size determines how much RAM or VRAM you need:

Model Size	Minimum RAM	Recommended RAM	GPU VRAM (Q4)
1B – 3B	4 GB	8 GB	2-4 GB
7B – 8B	8 GB	16 GB	5-6 GB
13B – 14B	16 GB	32 GB	8-10 GB
27B – 32B	32 GB	48 GB	16-20 GB
70B	64 GB	96 GB	38-45 GB

Key points:

Ollama runs on CPU if no GPU is detected — just 5-10x slower
Apple Silicon Macs use unified memory — a 16 GB M1/M2 runs 7B-8B models well
NVIDIA GPUs with 6+ GB VRAM get significant speedups
AMD/Intel GPUs supported via Vulkan backend

Using the API (Python)

Ollama provides a fully OpenAI-compatible API at http://localhost:11434/v1. Any code written for the OpenAI SDK works with Ollama by just changing the base URL.

With the OpenAI Python SDK

pip install openai

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by SDK but not used
)

response = client.chat.completions.create(
    model="llama3.1",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is machine learning?"}
    ]
)

print(response.choices[0].message.content)

Streaming

stream = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Write a short poem about coding"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

With the Native Ollama Library

pip install ollama

import ollama

response = ollama.chat(
    model="llama3.1",
    messages=[{"role": "user", "content": "Explain Docker in one paragraph"}]
)

print(response["message"]["content"])

Connect Ollama to OpenClaw (Free Local AI Agent)

Combine Ollama with OpenClaw to build a fully local, zero-cost AI agent that can browse the web, manage files, and chat with you on WhatsApp or Telegram — all running on your own hardware with complete data privacy.

Setup

# 1. Make sure Ollama is running with a model
ollama pull llama3.1

# 2. Install and configure OpenClaw
npm install -g openclaw@latest
openclaw onboard
# Select "Ollama" from the provider list

OpenClaw auto-discovers your local models and connects to them. You get a fully functional AI agent — shell commands, file operations, browser control, scheduled tasks — all powered by your local model.

Performance Tips

Quantization: The Key to Running Big Models

Quantization compresses models to use less memory with minimal quality loss:

Level	Quality Loss	VRAM per 1B Params	When to Use
Q4_K_M	~1-3%	~0.6 GB	Best balance (default, recommended)
Q5_K_M	Minimal	~0.75 GB	When quality matters more
Q8_0	Near-zero	~1.0 GB	When accuracy is critical
FP16	None	~2.0 GB	Research only

Other Optimization Tips

Monitor usage: Run ollama ps to check VRAM consumption
Reduce KV cache memory: OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
Keep models loaded: OLLAMA_KEEP_ALIVE=-1 prevents unloading between requests
Concurrent requests: OLLAMA_NUM_PARALLEL=4 for handling multiple sessions
Smaller models can be faster: A 14B Q4 model with GPU beats a 70B model on CPU

Local Ollama vs Cloud APIs

Factor	Ollama (Local)	Cloud APIs
Cost	$0 per token	Pay per token or monthly
Privacy	100% local, no data leaves your machine	Data sent to servers
Latency	10-50ms first token	200-800ms
Model Quality	Good for 80% of tasks	Frontier models still superior
Offline	Works without internet	Internet required
Setup	Requires hardware knowledge	Simple API key
Scalability	Limited by hardware	Infinite

Pro tip: Most developers in 2026 use a hybrid approach — Ollama for high-volume and sensitive tasks, cloud APIs for complex reasoning. This typically reduces cloud costs by 60-80%.

Which Model Should You Start With?

8 GB RAM, no GPU: Gemma 3 4B or Phi-3 Mini 3B
16 GB RAM or 8 GB VRAM: Llama 3.1 8B or Qwen 3 8B
32 GB RAM or 16 GB VRAM: DeepSeek-R1 14B or Gemma 3 27B
64+ GB RAM or 24+ GB VRAM: Llama 3.1 70B or Qwen 3 32B

Final Thoughts

Ollama has made running AI models locally as simple as installing any other app. With one command, you can download and chat with models that rival cloud APIs — completely free and with full data privacy.

Whether you’re a developer building AI-powered apps, a privacy-conscious user who wants to keep data on-device, or someone who just wants to experiment with different models without paying per token, Ollama is the easiest way to get started with local AI.

Get started: ollama.com | Model Library | GitHub

Originally published at toolfreebie.com.

DEV Community