Setting Up Ollama Locally: A Developer's Privacy-First AI Stack

#python #ai #selfhosted #tutorial

I moved all my AI workflows off cloud APIs last month. Ollama made it stupidly easy.

Here's my exact setup — from install to production-ready local inference in under an hour.

Why Local?

Three reasons:

Cost — I was spending $40-60/month on API calls. Now it's $0.
Privacy — My code, my prompts, my data. Never leaves my machine.
Speed — For small models, local is actually faster than round-tripping to an API.

The tradeoff? You need a decent GPU. I'm running an RTX 3060 12GB, which handles 7B-13B models comfortably.

Installation

One command on Linux:

curl -fsSL https://ollama.ai/install.sh | sh

Then pull a model:

ollama pull codellama:13b
ollama pull llama3:8b
ollama pull mistral:7b

These three cover 90% of my use cases — code generation, general chat, and fast iteration.

Python Integration

Ollama runs a local API server on localhost:11434. It's OpenAI-compatible, so your existing code barely needs to change:

import requests

def ask_ollama(prompt, model="llama3:8b"):
    resp = requests.post("http://localhost:11434/api/generate", json={
        "model": model,
        "prompt": prompt,
        "stream": False
    })
    return resp.json()["response"]

Or use the official Python library:

import ollama

response = ollama.chat(model="llama3:8b", messages=[
    {"role": "user", "content": "Explain decorators in Python"}
])
print(response["message"]["content"])

My Daily Driver Setup

I have three models running depending on the task:

Task	Model	Why
Code review	codellama:13b	Best at understanding code context
Quick questions	mistral:7b	Fast, good enough for simple stuff
Writing/analysis	llama3:8b	Best reasoning at this size

Performance Benchmarks

On my RTX 3060:

mistral:7b → ~45 tokens/sec
llama3:8b → ~35 tokens/sec
codellama:13b → ~20 tokens/sec

For reference, GPT-4 API typically returns 30-50 tokens/sec. So local 7B models are competitive on speed.

The Gotchas

Context window — Local models have shorter context (4K-8K typical). For long documents, you'll need to chunk.
Quality — 7B-13B models aren't as good as GPT-4 for complex reasoning. But for code completion, refactoring suggestions, and simple Q&A? They're great.
RAM — 13B models need about 8GB VRAM. If you only have 8GB total system RAM, stick to 7B.

Going Further

For a complete privacy-focused AI stack (not just text — image generation, transcription, the works), check out privacy-ai-guide.vercel.app. It covers local alternatives to every major AI service.

If you're specifically interested in running AI models through crypto-paid APIs (no accounts, no tracking), NanoGPT is worth a look — pay per prompt with crypto, zero registration.

Local AI isn't the future. It's the present. The tools are ready.