DEV Community

Cover image for Ollama: Run AI Models Locally for Free (Complete Setup Guide)
toolfreebie
toolfreebie

Posted on • Originally published at toolfreebie.com

Ollama: Run AI Models Locally for Free (Complete Setup Guide)

What is Ollama?

Ollama is a free, open-source tool that lets you run powerful AI models on your own computer. No cloud accounts, no API keys, no per-token fees — just download a model and start chatting. It supports models like Llama 3, DeepSeek, Qwen, Gemma, Mistral, and dozens more.

Whether you want complete data privacy, zero-cost AI inference, or the ability to work offline, Ollama makes local AI deployment as simple as a single command.

Installation

macOS

brew install ollama
Enter fullscreen mode Exit fullscreen mode

Or download from ollama.com/download. Requires macOS 11+.

Windows

winget install Ollama.Ollama
Enter fullscreen mode Exit fullscreen mode

Or download the installer from ollama.com/download. Requires Windows 10+.

Linux

curl -fsSL https://ollama.com/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

Docker

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
Enter fullscreen mode Exit fullscreen mode

Your First Model in 30 Seconds

After installation, running a model is just one command:

ollama run llama3.1
Enter fullscreen mode Exit fullscreen mode

Ollama automatically downloads the model (about 4.7 GB for Llama 3.1 8B) and starts an interactive chat. That’s it — you’re running a large language model locally.

Popular Models

Model Sizes Best For Download Size (Q4)
Llama 3.1 (Meta) 8B, 70B, 405B General purpose ~4.7 GB (8B)
DeepSeek-R1 1.5B – 671B Reasoning, math, coding ~4.7 GB (7B)
Qwen 3 (Alibaba) 0.6B – 235B Multilingual, 128K context ~4.7 GB (7B)
Gemma 3 (Google) 1B – 27B Balanced performance ~2.5 GB (4B)
Mistral 7B General purpose ~4.1 GB
Phi-3 (Microsoft) 3B, 14B Lightweight, coding ~2.2 GB (3B)
Gemma 4 (Google) 2B – 31B Latest gen, MoE ~1.5 GB (2B)

Essential Commands

# Download a model
ollama pull llama3.1

# Start chatting
ollama run llama3.1

# One-shot prompt (no interactive mode)
ollama run llama3.1 "Explain Docker in one paragraph"

# List downloaded models
ollama list

# Show running models and memory usage
ollama ps

# Delete a model
ollama rm llama3.1
Enter fullscreen mode Exit fullscreen mode

Hardware Requirements

The model size determines how much RAM or VRAM you need:

Model Size Minimum RAM Recommended RAM GPU VRAM (Q4)
1B – 3B 4 GB 8 GB 2-4 GB
7B – 8B 8 GB 16 GB 5-6 GB
13B – 14B 16 GB 32 GB 8-10 GB
27B – 32B 32 GB 48 GB 16-20 GB
70B 64 GB 96 GB 38-45 GB

Key points:

  • Ollama runs on CPU if no GPU is detected — just 5-10x slower
  • Apple Silicon Macs use unified memory — a 16 GB M1/M2 runs 7B-8B models well
  • NVIDIA GPUs with 6+ GB VRAM get significant speedups
  • AMD/Intel GPUs supported via Vulkan backend

Using the API (Python)

Ollama provides a fully OpenAI-compatible API at http://localhost:11434/v1. Any code written for the OpenAI SDK works with Ollama by just changing the base URL.

With the OpenAI Python SDK

pip install openai
Enter fullscreen mode Exit fullscreen mode
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by SDK but not used
)

response = client.chat.completions.create(
    model="llama3.1",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is machine learning?"}
    ]
)

print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

Streaming

stream = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Write a short poem about coding"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
Enter fullscreen mode Exit fullscreen mode

With the Native Ollama Library

pip install ollama
Enter fullscreen mode Exit fullscreen mode
import ollama

response = ollama.chat(
    model="llama3.1",
    messages=[{"role": "user", "content": "Explain Docker in one paragraph"}]
)

print(response["message"]["content"])
Enter fullscreen mode Exit fullscreen mode

Connect Ollama to OpenClaw (Free Local AI Agent)

Combine Ollama with OpenClaw to build a fully local, zero-cost AI agent that can browse the web, manage files, and chat with you on WhatsApp or Telegram — all running on your own hardware with complete data privacy.

Setup

# 1. Make sure Ollama is running with a model
ollama pull llama3.1

# 2. Install and configure OpenClaw
npm install -g openclaw@latest
openclaw onboard
# Select "Ollama" from the provider list
Enter fullscreen mode Exit fullscreen mode

OpenClaw auto-discovers your local models and connects to them. You get a fully functional AI agent — shell commands, file operations, browser control, scheduled tasks — all powered by your local model.

Performance Tips

Quantization: The Key to Running Big Models

Quantization compresses models to use less memory with minimal quality loss:

Level Quality Loss VRAM per 1B Params When to Use
Q4_K_M ~1-3% ~0.6 GB Best balance (default, recommended)
Q5_K_M Minimal ~0.75 GB When quality matters more
Q8_0 Near-zero ~1.0 GB When accuracy is critical
FP16 None ~2.0 GB Research only

Other Optimization Tips

  • Monitor usage: Run ollama ps to check VRAM consumption
  • Reduce KV cache memory: OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
  • Keep models loaded: OLLAMA_KEEP_ALIVE=-1 prevents unloading between requests
  • Concurrent requests: OLLAMA_NUM_PARALLEL=4 for handling multiple sessions
  • Smaller models can be faster: A 14B Q4 model with GPU beats a 70B model on CPU

Local Ollama vs Cloud APIs

Factor Ollama (Local) Cloud APIs
Cost $0 per token Pay per token or monthly
Privacy 100% local, no data leaves your machine Data sent to servers
Latency 10-50ms first token 200-800ms
Model Quality Good for 80% of tasks Frontier models still superior
Offline Works without internet Internet required
Setup Requires hardware knowledge Simple API key
Scalability Limited by hardware Infinite

Pro tip: Most developers in 2026 use a hybrid approach — Ollama for high-volume and sensitive tasks, cloud APIs for complex reasoning. This typically reduces cloud costs by 60-80%.

Which Model Should You Start With?

  • 8 GB RAM, no GPU: Gemma 3 4B or Phi-3 Mini 3B
  • 16 GB RAM or 8 GB VRAM: Llama 3.1 8B or Qwen 3 8B
  • 32 GB RAM or 16 GB VRAM: DeepSeek-R1 14B or Gemma 3 27B
  • 64+ GB RAM or 24+ GB VRAM: Llama 3.1 70B or Qwen 3 32B

Related Reads

Final Thoughts

Ollama has made running AI models locally as simple as installing any other app. With one command, you can download and chat with models that rival cloud APIs — completely free and with full data privacy.

Whether you’re a developer building AI-powered apps, a privacy-conscious user who wants to keep data on-device, or someone who just wants to experiment with different models without paying per token, Ollama is the easiest way to get started with local AI.

Get started: ollama.com | Model Library | GitHub


Originally published at toolfreebie.com.

Top comments (0)