DEV Community

Sam Hartley
Sam Hartley

Posted on

I Ditched OpenAI and Run AI Locally for Free — Here's How

I Ditched OpenAI and Run AI Locally for Free — Here's How

I was spending ~$80/month on API calls. ChatGPT Plus, some Anthropic credits, the occasional Gemini Pro request. It adds up fast when you're prototyping things.

Then I discovered you can run surprisingly good models on hardware you probably already own. I've been running a fully local AI setup for about a month now, and my API bill went to zero.

Here's the exact setup I'm using.

The Hardware (Nothing Fancy)

My main inference machine is a desktop PC with an RTX 3060 (12GB VRAM). You can find these used for ~$150. That's it. No A100, no cloud GPU rental.

For context:

  • 8B parameter models (like Qwen 3.5) run at ~40 tokens/sec on this card
  • 30B parameter models (like Qwen 3 Coder) run at a comfortable ~12 tokens/sec
  • Even on a MacBook M1 with 16GB RAM, 8B models are perfectly usable

If you have any modern GPU with 8GB+ VRAM, or an Apple Silicon Mac, you're good.

Step 1: Install Ollama

This is the part that surprised me. No Docker, no conda environments, no dependency nightmares:

curl -fsSL https://ollama.com/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

On macOS, you can also brew install ollama. Windows has an installer. That's literally it.

Step 2: Pull a Model

# Good general-purpose model
ollama pull qwen3.5:9b

# Great for code
ollama pull qwen3-coder:30b

# Solid reasoning
ollama pull deepseek-r1:8b
Enter fullscreen mode Exit fullscreen mode

Models download once (~5-18GB depending on size) and run forever. No recurring costs.

Step 3: Use It

Interactive:

ollama run qwen3.5:9b
>>> Explain the difference between async and parallel execution
Enter fullscreen mode Exit fullscreen mode

Or via API — and here's the killer feature: it's OpenAI-compatible:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5:9b",
    "messages": [{"role": "user", "content": "Explain Docker in 3 sentences"}]
  }'
Enter fullscreen mode Exit fullscreen mode

Any tool built for the OpenAI API works by just changing the base URL. I've connected it to VS Code extensions, custom scripts, even a Telegram bot.

What I Actually Use This For

This isn't theoretical. Here's what I run daily:

  • Code reviews — I pipe diffs through the 30B coder model. It catches things I miss, especially in languages I'm less familiar with
  • A Telegram bot — Runs 24/7 on a Mac Mini, answers questions using Qwen 3.5. Nobody can tell it's local
  • Document Q&A — RAG pipeline with local embeddings. Load PDFs, ask questions
  • Quick lookups — Instead of context-switching to ChatGPT, I just ollama run in the terminal

The Real Cost Comparison

Monthly Cost Privacy Latency
OpenAI GPT-4o $20-200+ Data leaves your machine ~1-3s
Local Ollama ~$5 electricity Everything stays local ~0.5-2s

The electricity cost is real but negligible. My PC draws about 250W under GPU load, and I'm not running inference 24/7.

Things I Wish I Knew Earlier

  1. GPU matters way more than CPU. A $150 used RTX 3060 is 10-15x faster than even a high-end CPU for inference
  2. Start with smaller models. 7-9B models are shockingly capable. Don't jump to 70B thinking bigger = better — the speed tradeoff isn't worth it for most tasks
  3. Different models for different jobs. I use the coder model for code, the reasoning model for analysis, and the general model for chat. Specialization matters
  4. Make it a network service. Set OLLAMA_HOST=0.0.0.0 and every device in your house can use it
  5. It works offline. Plane, cabin, whatever. No internet needed after the initial model download

Is It As Good As GPT-4?

Honestly? For 80% of what I was using GPT-4 for, yes. The 9B models handle everyday coding questions, text generation, and analysis just fine.

For the really hard stuff — complex multi-step reasoning, very long context — the cloud models still have an edge. But I find myself needing that maybe once a week. Not worth $80/month.

Try It

If you have a gaming PC or a recent Mac, you can be up and running in literally 5 minutes. curl | sh, ollama pull, ollama run. That's the whole setup.

The worst that happens is you wasted 10 minutes. The best? You save thousands of dollars a year and keep your data private.


What models are you running locally? I'm always looking for recommendations — drop them in the comments.

Top comments (0)