I Ditched OpenAI and Run AI Locally for Free — Here's How
I was spending ~$80/month on API calls. ChatGPT Plus, some Anthropic credits, the occasional Gemini Pro request. It adds up fast when you're prototyping things.
Then I discovered you can run surprisingly good models on hardware you probably already own. I've been running a fully local AI setup for about a month now, and my API bill went to zero.
Here's the exact setup I'm using.
The Hardware (Nothing Fancy)
My main inference machine is a desktop PC with an RTX 3060 (12GB VRAM). You can find these used for ~$150. That's it. No A100, no cloud GPU rental.
For context:
- 8B parameter models (like Qwen 3.5) run at ~40 tokens/sec on this card
- 30B parameter models (like Qwen 3 Coder) run at a comfortable ~12 tokens/sec
- Even on a MacBook M1 with 16GB RAM, 8B models are perfectly usable
If you have any modern GPU with 8GB+ VRAM, or an Apple Silicon Mac, you're good.
Step 1: Install Ollama
This is the part that surprised me. No Docker, no conda environments, no dependency nightmares:
curl -fsSL https://ollama.com/install.sh | sh
On macOS, you can also brew install ollama. Windows has an installer. That's literally it.
Step 2: Pull a Model
# Good general-purpose model
ollama pull qwen3.5:9b
# Great for code
ollama pull qwen3-coder:30b
# Solid reasoning
ollama pull deepseek-r1:8b
Models download once (~5-18GB depending on size) and run forever. No recurring costs.
Step 3: Use It
Interactive:
ollama run qwen3.5:9b
>>> Explain the difference between async and parallel execution
Or via API — and here's the killer feature: it's OpenAI-compatible:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5:9b",
"messages": [{"role": "user", "content": "Explain Docker in 3 sentences"}]
}'
Any tool built for the OpenAI API works by just changing the base URL. I've connected it to VS Code extensions, custom scripts, even a Telegram bot.
What I Actually Use This For
This isn't theoretical. Here's what I run daily:
- Code reviews — I pipe diffs through the 30B coder model. It catches things I miss, especially in languages I'm less familiar with
- A Telegram bot — Runs 24/7 on a Mac Mini, answers questions using Qwen 3.5. Nobody can tell it's local
- Document Q&A — RAG pipeline with local embeddings. Load PDFs, ask questions
-
Quick lookups — Instead of context-switching to ChatGPT, I just
ollama runin the terminal
The Real Cost Comparison
| Monthly Cost | Privacy | Latency | |
|---|---|---|---|
| OpenAI GPT-4o | $20-200+ | Data leaves your machine | ~1-3s |
| Local Ollama | ~$5 electricity | Everything stays local | ~0.5-2s |
The electricity cost is real but negligible. My PC draws about 250W under GPU load, and I'm not running inference 24/7.
Things I Wish I Knew Earlier
- GPU matters way more than CPU. A $150 used RTX 3060 is 10-15x faster than even a high-end CPU for inference
- Start with smaller models. 7-9B models are shockingly capable. Don't jump to 70B thinking bigger = better — the speed tradeoff isn't worth it for most tasks
- Different models for different jobs. I use the coder model for code, the reasoning model for analysis, and the general model for chat. Specialization matters
-
Make it a network service. Set
OLLAMA_HOST=0.0.0.0and every device in your house can use it - It works offline. Plane, cabin, whatever. No internet needed after the initial model download
Is It As Good As GPT-4?
Honestly? For 80% of what I was using GPT-4 for, yes. The 9B models handle everyday coding questions, text generation, and analysis just fine.
For the really hard stuff — complex multi-step reasoning, very long context — the cloud models still have an edge. But I find myself needing that maybe once a week. Not worth $80/month.
Try It
If you have a gaming PC or a recent Mac, you can be up and running in literally 5 minutes. curl | sh, ollama pull, ollama run. That's the whole setup.
The worst that happens is you wasted 10 minutes. The best? You save thousands of dollars a year and keep your data private.
What models are you running locally? I'm always looking for recommendations — drop them in the comments.
Top comments (0)