DEV Community

noxlie
noxlie

Posted on

Setting Up Ollama Locally: A Developer's Privacy-First AI Stack

I moved all my AI workflows off cloud APIs last month. Ollama made it stupidly easy.

Here's my exact setup — from install to production-ready local inference in under an hour.

Why Local?

Three reasons:

  1. Cost — I was spending $40-60/month on API calls. Now it's $0.
  2. Privacy — My code, my prompts, my data. Never leaves my machine.
  3. Speed — For small models, local is actually faster than round-tripping to an API.

The tradeoff? You need a decent GPU. I'm running an RTX 3060 12GB, which handles 7B-13B models comfortably.

Installation

One command on Linux:

curl -fsSL https://ollama.ai/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

Then pull a model:

ollama pull codellama:13b
ollama pull llama3:8b
ollama pull mistral:7b
Enter fullscreen mode Exit fullscreen mode

These three cover 90% of my use cases — code generation, general chat, and fast iteration.

Python Integration

Ollama runs a local API server on localhost:11434. It's OpenAI-compatible, so your existing code barely needs to change:

import requests

def ask_ollama(prompt, model="llama3:8b"):
    resp = requests.post("http://localhost:11434/api/generate", json={
        "model": model,
        "prompt": prompt,
        "stream": False
    })
    return resp.json()["response"]
Enter fullscreen mode Exit fullscreen mode

Or use the official Python library:

import ollama

response = ollama.chat(model="llama3:8b", messages=[
    {"role": "user", "content": "Explain decorators in Python"}
])
print(response["message"]["content"])
Enter fullscreen mode Exit fullscreen mode

My Daily Driver Setup

I have three models running depending on the task:

Task Model Why
Code review codellama:13b Best at understanding code context
Quick questions mistral:7b Fast, good enough for simple stuff
Writing/analysis llama3:8b Best reasoning at this size

Performance Benchmarks

On my RTX 3060:

  • mistral:7b → ~45 tokens/sec
  • llama3:8b → ~35 tokens/sec
  • codellama:13b → ~20 tokens/sec

For reference, GPT-4 API typically returns 30-50 tokens/sec. So local 7B models are competitive on speed.

The Gotchas

  • Context window — Local models have shorter context (4K-8K typical). For long documents, you'll need to chunk.
  • Quality — 7B-13B models aren't as good as GPT-4 for complex reasoning. But for code completion, refactoring suggestions, and simple Q&A? They're great.
  • RAM — 13B models need about 8GB VRAM. If you only have 8GB total system RAM, stick to 7B.

Going Further

For a complete privacy-focused AI stack (not just text — image generation, transcription, the works), check out privacy-ai-guide.vercel.app. It covers local alternatives to every major AI service.

If you're specifically interested in running AI models through crypto-paid APIs (no accounts, no tracking), NanoGPT is worth a look — pay per prompt with crypto, zero registration.

Local AI isn't the future. It's the present. The tools are ready.

Top comments (0)