You don't need OpenAI. You don't need a $200/month API bill. You can run powerful AI models on hardware you already own — for free.
Here's exactly how I set this up, and why I haven't paid for API credits in months.
Why Local AI?
- Zero API costs — no per-token billing, no surprise invoices
- Full privacy — your data never leaves your network
- No rate limits — run as many queries as your hardware allows
- Works offline — no internet? No problem
- No vendor lock-in — switch models, change configs, own your stack
What You Need
Any modern computer works. Here's what different setups can handle:
| Hardware | RAM | Best Models | Speed |
|---|---|---|---|
| MacBook M1/M2/M3/M4 | 8-16GB | Qwen 3.5 9B, Llama 3.1 8B | Fast ⚡ |
| Gaming PC (RTX 3060+) | 16GB+ | Qwen 3 Coder 30B, DeepSeek R1 | Very Fast 🚀 |
| Old laptop/desktop | 8GB+ | Phi-3 Mini, Gemma 2B | Usable 🐢 |
| Raspberry Pi 5 | 8GB | Tiny models only | Slow 🐌 |
The sweet spot: A used gaming GPU (RTX 3060 12GB) costs ~$150 on eBay and runs 30B parameter models comfortably.
Step 1: Install Ollama (2 minutes)
# macOS or Linux — one command
curl -fsSL https://ollama.com/install.sh | sh
# Windows — download from ollama.com
That's it. No Docker, no Python environments, no dependency hell.
Step 2: Download a Model (5 minutes)
# Fast & capable (recommended starter)
ollama pull qwen3.5:9b
# Code specialist
ollama pull qwen3-coder:30b
# Reasoning powerhouse
ollama pull deepseek-r1:8b
Models download once and run locally forever.
Step 3: Start Using It
Interactive Chat
ollama run qwen3.5:9b
>>> What's the fastest sorting algorithm for nearly-sorted data?
API Access (OpenAI-compatible!)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5:9b",
"messages": [{"role": "user", "content": "Explain Docker in 3 sentences"}]
}'
Yes — it's OpenAI API compatible. Any tool that works with GPT works with Ollama. Just change the base URL.
Step 4: Make It a Server
Want other devices on your network to access it?
# Start Ollama with network access
OLLAMA_HOST=0.0.0.0 ollama serve
Now any device on your network can query http://YOUR_IP:11434.
What I've Built With This:
- Telegram Bot running 24/7 on a Mac Mini, answering questions via local Qwen 3.5
- Code Review Agent using Qwen 3 Coder 30B — reviews PRs in ~12 seconds
- Document Q&A with RAG pipeline — load PDFs, ask questions, get cited answers
- Garmin Watch Face that fetches stock data (the background service uses local AI for formatting)
The Cost Comparison
| Solution | Monthly Cost | Privacy | Speed |
|---|---|---|---|
| OpenAI GPT-4o | $20-200+ | ❌ Cloud | Fast |
| Anthropic Claude | $20-100+ | ❌ Cloud | Fast |
| Google Gemini | $0-25+ | ❌ Cloud | Fast |
| Ollama (Local) | $0 | ✅ Private | Fast |
The only cost is electricity — roughly $5-15/month if running 24/7 on a desktop PC.
Pro Tips
- Use GPU, not CPU — A $150 used RTX 3060 is 10-15x faster than any CPU
- Start with 7-9B models — They're surprisingly capable and fast
- Try different models for different tasks — coding, reasoning, and chat each have specialists
- Enable the OpenAI-compatible API — instant compatibility with thousands of tools
-
Set up auto-start —
systemctl enable ollamaon Linux, launchd on macOS - Run multiple models — I keep 3-4 models loaded and switch based on the task
My Current Setup
I run a 3-machine lab:
| Machine | Role | Model |
|---|---|---|
| Mac Mini M4 | Quick chat, orchestration | Qwen 3 4B |
| Windows PC (RTX 3060) | Heavy inference, coding | Qwen 3 Coder 30B |
| Ubuntu box | Fallback, background tasks | minicpm-v |
Total monthly API cost: $0. Total hardware cost: one $150 used GPU.
Want to Go Deeper?
I write about running AI locally, home lab setups, and turning hardware into income. If you want more of this, drop a comment — I read every one.
Other posts in this series:
Top comments (0)