Sam Hartley

Posted on Jun 8

Run Your Own AI Server for $0/month with Ollama

#ai #opensource #llm #tutorial

You don't need OpenAI. You don't need a $200/month API bill. You can run powerful AI models on hardware you already own — for free.

Here's exactly how I set this up, and why I haven't paid for API credits in months.

Why Local AI?

Zero API costs — no per-token billing, no surprise invoices
Full privacy — your data never leaves your network
No rate limits — run as many queries as your hardware allows
Works offline — no internet? No problem
No vendor lock-in — switch models, change configs, own your stack

What You Need

Any modern computer works. Here's what different setups can handle:

Hardware	RAM	Best Models	Speed
MacBook M1/M2/M3/M4	8-16GB	Qwen 3.5 9B, Llama 3.1 8B	Fast ⚡
Gaming PC (RTX 3060+)	16GB+	Qwen 3 Coder 30B, DeepSeek R1	Very Fast 🚀
Old laptop/desktop	8GB+	Phi-3 Mini, Gemma 2B	Usable 🐢
Raspberry Pi 5	8GB	Tiny models only	Slow 🐌

The sweet spot: A used gaming GPU (RTX 3060 12GB) costs ~$150 on eBay and runs 30B parameter models comfortably.

Step 1: Install Ollama (2 minutes)

# macOS or Linux — one command
curl -fsSL https://ollama.com/install.sh | sh

# Windows — download from ollama.com

That's it. No Docker, no Python environments, no dependency hell.

Step 2: Download a Model (5 minutes)

# Fast & capable (recommended starter)
ollama pull qwen3.5:9b

# Code specialist
ollama pull qwen3-coder:30b

# Reasoning powerhouse
ollama pull deepseek-r1:8b

Models download once and run locally forever.

Step 3: Start Using It

Interactive Chat

ollama run qwen3.5:9b

>>> What's the fastest sorting algorithm for nearly-sorted data?

API Access (OpenAI-compatible!)

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5:9b",
    "messages": [{"role": "user", "content": "Explain Docker in 3 sentences"}]
  }'

Yes — it's OpenAI API compatible. Any tool that works with GPT works with Ollama. Just change the base URL.

Step 4: Make It a Server

Want other devices on your network to access it?

# Start Ollama with network access
OLLAMA_HOST=0.0.0.0 ollama serve

Now any device on your network can query http://YOUR_IP:11434.

What I've Built With This:

Telegram Bot running 24/7 on a Mac Mini, answering questions via local Qwen 3.5
Code Review Agent using Qwen 3 Coder 30B — reviews PRs in ~12 seconds
Document Q&A with RAG pipeline — load PDFs, ask questions, get cited answers
Garmin Watch Face that fetches stock data (the background service uses local AI for formatting)

The Cost Comparison

Solution	Monthly Cost	Privacy	Speed
OpenAI GPT-4o	$20-200+	❌ Cloud	Fast
Anthropic Claude	$20-100+	❌ Cloud	Fast
Google Gemini	$0-25+	❌ Cloud	Fast
Ollama (Local)	$0	✅ Private	Fast

The only cost is electricity — roughly $5-15/month if running 24/7 on a desktop PC.

Pro Tips

Use GPU, not CPU — A $150 used RTX 3060 is 10-15x faster than any CPU
Start with 7-9B models — They're surprisingly capable and fast
Try different models for different tasks — coding, reasoning, and chat each have specialists
Enable the OpenAI-compatible API — instant compatibility with thousands of tools
Set up auto-start — systemctl enable ollama on Linux, launchd on macOS
Run multiple models — I keep 3-4 models loaded and switch based on the task

My Current Setup

I run a 3-machine lab:

Machine	Role	Model
Mac Mini M4	Quick chat, orchestration	Qwen 3 4B
Windows PC (RTX 3060)	Heavy inference, coding	Qwen 3 Coder 30B
Ubuntu box	Fallback, background tasks	minicpm-v

Total monthly API cost: $0. Total hardware cost: one $150 used GPU.

Want to Go Deeper?

I write about running AI locally, home lab setups, and turning hardware into income. If you want more of this, drop a comment — I read every one.

Top comments (1)

Max Quimby • Jun 13

Great practical writeup — I run a couple of always-on local models on a Mac Mini for exactly the "no per-token anxiety" reason, and it genuinely changes how freely you experiment. One nuance I'd add so people set expectations right: "$0/month" is really "cost moved from opex to capex + your time." The number that decides whether local actually wins is concurrency, not single-stream speed. Ollama is lovely for one request at a time, but the moment you have a handful of simultaneous users it serializes and your p95 falls off a cliff — that's where batching runtimes (vLLM, or llama.cpp's continuous batching) start to matter and the comparison table gets more complicated. For a personal bot or a code-review agent that runs one PR at a time, local is the right call and the privacy win is real. Have you pushed yours past single-user — any luck with batched serving, or is it strictly one-at-a-time on the 3060?