I moved all my AI workflows off cloud APIs last month. Ollama made it stupidly easy.
Here's my exact setup — from install to production-ready local inference in under an hour.
Why Local?
Three reasons:
- Cost — I was spending $40-60/month on API calls. Now it's $0.
- Privacy — My code, my prompts, my data. Never leaves my machine.
- Speed — For small models, local is actually faster than round-tripping to an API.
The tradeoff? You need a decent GPU. I'm running an RTX 3060 12GB, which handles 7B-13B models comfortably.
Installation
One command on Linux:
curl -fsSL https://ollama.ai/install.sh | sh
Then pull a model:
ollama pull codellama:13b
ollama pull llama3:8b
ollama pull mistral:7b
These three cover 90% of my use cases — code generation, general chat, and fast iteration.
Python Integration
Ollama runs a local API server on localhost:11434. It's OpenAI-compatible, so your existing code barely needs to change:
import requests
def ask_ollama(prompt, model="llama3:8b"):
resp = requests.post("http://localhost:11434/api/generate", json={
"model": model,
"prompt": prompt,
"stream": False
})
return resp.json()["response"]
Or use the official Python library:
import ollama
response = ollama.chat(model="llama3:8b", messages=[
{"role": "user", "content": "Explain decorators in Python"}
])
print(response["message"]["content"])
My Daily Driver Setup
I have three models running depending on the task:
| Task | Model | Why |
|---|---|---|
| Code review | codellama:13b | Best at understanding code context |
| Quick questions | mistral:7b | Fast, good enough for simple stuff |
| Writing/analysis | llama3:8b | Best reasoning at this size |
Performance Benchmarks
On my RTX 3060:
- mistral:7b → ~45 tokens/sec
- llama3:8b → ~35 tokens/sec
- codellama:13b → ~20 tokens/sec
For reference, GPT-4 API typically returns 30-50 tokens/sec. So local 7B models are competitive on speed.
The Gotchas
- Context window — Local models have shorter context (4K-8K typical). For long documents, you'll need to chunk.
- Quality — 7B-13B models aren't as good as GPT-4 for complex reasoning. But for code completion, refactoring suggestions, and simple Q&A? They're great.
- RAM — 13B models need about 8GB VRAM. If you only have 8GB total system RAM, stick to 7B.
Going Further
For a complete privacy-focused AI stack (not just text — image generation, transcription, the works), check out privacy-ai-guide.vercel.app. It covers local alternatives to every major AI service.
If you're specifically interested in running AI models through crypto-paid APIs (no accounts, no tracking), NanoGPT is worth a look — pay per prompt with crypto, zero registration.
Local AI isn't the future. It's the present. The tools are ready.
Top comments (0)