Prim Ghost

Posted on Mar 29 • Edited on Apr 1

Run Your Own AI Model Locally: A Practical Ollama Setup Guide (2026)

#ollama #selfhosted #ai #linux

Running AI models locally has become surprisingly accessible. With Ollama, you can run capable language models on a laptop or desktop — no API keys, no subscriptions, no internet required.

Here's a practical guide to getting set up, choosing the right model, and actually using local AI for something useful.

Why Run AI Locally?

Three main reasons people do this:

Privacy. Your prompts never leave your machine. If you're processing code, client data, personal notes, or anything sensitive, local means you control where it goes.

Cost. After hardware, inference is free. No per-token billing, no monthly subscriptions, no rate limits. Run it as much as you want.

Ownership. The model doesn't change overnight, doesn't go down, doesn't require an internet connection. Works on a plane, in a basement, wherever.

The tradeoff is hardware. You need a GPU with enough VRAM to fit the model, or you fall back to CPU inference (slow but usable for some tasks).

What You Actually Need

Minimum Viable Setup

Any modern CPU (Intel 10th gen+, Ryzen 3000+)
8GB RAM (16GB better)
No GPU required — CPU inference works, just slower
~5-10GB disk space per model

GPU Setup (Recommended)

NVIDIA GPU with 6GB+ VRAM for 7B models
8-12GB VRAM for 13-14B models
16GB VRAM for comfortable 27B models
AMD GPU works too (ROCm support, somewhat newer)

What "VRAM" Actually Means

VRAM is your bottleneck. A model loaded into VRAM runs fast (GPU inference). A model that overflows to RAM runs slow (partial CPU fallback). A model entirely on CPU is slower still but still works.

Rule of thumb: a 7B model at Q4 quantization needs about 4-5GB VRAM. A 14B model needs 8-10GB. A 27B model needs 15-16GB.

Installing Ollama

Ollama is available for Linux, macOS, and Windows. Installation is straightforward.

Linux:

curl -fsSL https://ollama.com/install.sh | sh

macOS: Download from ollama.com — native app with menu bar integration.

Windows: Native installer available at ollama.com.

After install, Ollama runs as a background service and exposes a REST API at http://localhost:11434.

Your First Model

After installing Ollama, pull and run your first model:

# Pull a model
ollama pull llama3.2

# Run it in the terminal
ollama run llama3.2

# Or use a specific model
ollama run qwen2.5:14b

The first pull downloads the model weights (~4-16GB depending on size). After that, launching is instant.

Choosing the Right Model for Your Hardware

If you have no GPU (CPU only)

Llama 3.2 3B — Fast enough for quick tasks. Good for summarizing, drafting, answering questions. Limited by small context.

Phi-3 Mini — Microsoft's 3.8B model. Surprisingly capable for its size. Excellent for CPU inference.

If you have 6-8GB VRAM

Llama 3.1 8B — Meta's flagship small model. Versatile, fast. Great starting point.

Mistral 7B — Fast, efficient, strong for its size. Good instruction following.

Qwen2.5 7B — Strong coding performance in a small package.

If you have 10-12GB VRAM

Llama 3.1 8B Q8 — Higher quality than Q4, fits comfortably in 12GB.

Qwen2.5 14B Q4 — Best quality/speed tradeoff in this range. Good at code and reasoning.

Phi-4 14B — Microsoft's current flagship. Very capable for its size.

If you have 16GB VRAM

Qwen2.5 32B Q3/Q4 — This is where it gets interesting. 32B class performance at 16GB.

DeepSeek R1 14B — Reasoning-focused model. Slower but more careful reasoning. Great for complex tasks.

Devstral 24B — Coding specialist. Excellent for code generation, review, debugging.

The Ollama API (Actually Useful)

Ollama exposes a REST API that you can call from any language:

# Simple curl call
curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.2",
    "prompt": "Explain Docker in one paragraph",
    "stream": false
  }'

# Python
import requests

response = requests.post("http://localhost:11434/api/generate", json={
    "model": "llama3.2",
    "prompt": "Write a Python function that reads a CSV and returns the top 5 rows",
    "stream": False
})

print(response.json()["response"])

This is what makes Ollama powerful for automation. You can pipe it into scripts, build small apps, automate content generation — all running locally.

OpenAI-Compatible API Mode

Ollama also runs in OpenAI-compatible mode, which means any tool built for the OpenAI API works with Ollama:

# Same endpoint format as OpenAI
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "user", "content": "What is a homelab?"}
    ]
  }'

Tools like Continue (VS Code extension), Open WebUI, Obsidian AI, and OpenClaw all support connecting to a local Ollama instance this way.

Open WebUI: The Best UI for Ollama

If you want a ChatGPT-style interface for your local models, Open WebUI is the best option.

Install with Docker:

docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000. It connects to your local Ollama automatically.

Features worth using:

Model switching mid-conversation
Document upload and chat (RAG)
Conversation history
System prompt customization

Useful Local AI Tasks

What people actually do with local AI:

Code Review and Debugging

Paste a function. Ask what's wrong with it. No code ever leaves your machine.

Document Summarization

Feed a long PDF or article. Get a clean summary. Useful for research, reading, catching up.

Writing First Drafts

Brief → full draft in seconds. Edit down from there. Faster than staring at a blank page.

Private Q&A

Anything you'd normally Google but don't want tracked — medical questions, legal basics, financial concepts.

Scripting and Automation

Describe what you want a script to do. Get working Python or bash as a starting point.

Git Commit Messages

Paste your diff. Ask for a clean commit message. Small thing, constant annoyance solved.

Model Chaining and Pipelines

More advanced: you can chain Ollama calls to build small pipelines.

Example: summarize a web page, then extract action items, then format as a structured report — three separate prompts, each feeding into the next.

Libraries like LangChain, LlamaIndex, and OpenAI-SDK (pointed at Ollama's API) all support this. Local inference makes these workflows free to run as much as you want.

The Thing Everyone Misses

People try local AI, get mediocre results, and blame the model.

Usually the problem is the prompt. Local models are more sensitive to prompt quality than hosted models. They benefit from:

Specific, clear instructions
Examples of the output format you want
System prompts that set context and constraints

The model is doing its job. Your job is giving it the right input.

Next Steps

If this is interesting to you:

Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
Pull a model: ollama pull llama3.2
Run it: ollama run llama3.2
If you have a decent GPU: try qwen2.5:14b
Explore Open WebUI for a proper chat interface

The ecosystem moves fast — check ollama.com/library for new models as they drop.

Running a homelab and want to go deeper? The Homelab Starter Guide covers self-hosting fundamentals including setting up Docker, securing your services, and building a proper local stack.

DEV Community