Pranay ravi

Posted on May 17

How I Built a Completely Free Local AI Stack — Inspired by a 60-Second YouTube Short

#ai #llm #showdev #tutorial

How I Built a Completely Free Local AI Stack — Inspired by a 60-Second YouTube Short

By Pranaychandra Ravi

It started with a YouTube Short. Someone on my feed casually demonstrated connecting a local AI model to Claude Code and I stopped mid-scroll. No API key. No subscription. No code leaving their machine. I had to know how it worked.

What followed was a deep dive into local AI — Ollama, Gemma4, Docker, Open WebUI, vector databases, context windows, and a Python script that made my local model generate an ASCII diagram of the Earth and Moon. This post documents everything I learned, every question I asked, and every mistake I made along the way. If you're curious about running AI entirely on your own hardware, this one is for you.

First Question: Wait, Is This Actually Free?

My first instinct was skepticism. Claude Code is Anthropic's product. Surely using it requires a Claude subscription?

The short answer is no — not when you pair it with Ollama and a local model.

Here's what I learned: Claude Code is the agent — the tool that reads your files, runs commands, edits code, and manages multi-step tasks in your terminal. By default it calls Anthropic's API, which costs money. But Claude Code exposes environment variables that let you redirect those API calls anywhere you want — including a local Ollama server running on your own machine.

Ollama added official support for Anthropic's Messages API format, meaning Claude Code can talk to it natively. No hacks, no middleware, no subscription. The only cost is your own electricity and hardware.

Claude Code  →  talks to  →  Ollama (local server)  →  runs  →  Your model
                              (no Anthropic servers involved)

So What Exactly Is Ollama?

Before I could set anything up I needed to understand what Ollama actually is, because "install Ollama" doesn't tell you much.

Think of Ollama as two things in one:

1. A model manager — it downloads, stores, and organizes AI models on your machine. Like a package manager but for AI brains.

2. A local API server — once running, it exposes an endpoint at http://localhost:11434 that any application can call. Your code, Claude Code, Open WebUI, VS Code extensions — anything that speaks the Anthropic or OpenAI API format can connect to it.

This is the key insight I kept coming back to: Ollama itself has no intelligence. It's an empty engine. You have to download a model — a large file containing all the AI's weights and knowledge — before anything useful happens.

Without a model:   Ollama = empty server, useless
With a model:      Ollama = fully local AI, free forever

Downloading Your First Model — Which One?

This is where hardware matters. I have:

32GB RAM
NVIDIA GPU with ~11GB VRAM
Core i9 processor

With an NVIDIA card, Ollama automatically uses CUDA — no setup needed. Your GPU handles inference and it's dramatically faster than CPU-only.

The key concept here is VRAM vs RAM:

Model fits in VRAM  →  GPU handles everything  →  Very fast ✅
Model too big for VRAM  →  spills into system RAM  →  Slower ⚠️

With 11GB VRAM I can fit most 7B–13B parameter models entirely in GPU memory, which means fast, snappy responses.

After thinking through my use cases — coding help, image analysis, document review — I landed on Gemma4 (Google's multimodal model, ~12GB). Here's why it beat out alternatives like Qwen3.6 (28GB):

	Gemma4	Qwen3.6
Size	~12GB	~28GB
Fits in 11GB VRAM	Nearly (tiny RAM overflow)	Partial (big RAM spill)
Image understanding	✅ Yes (multimodal)	❌ No
Coding quality	Good	Better
Speed on my hardware	Fast	Slower

My use cases included image-to-text extraction and converting images to coloring pages — Qwen3.6 can't do either because it's text-only. Gemma4 won.

ollama pull gemma4

One command. It downloads, verifies, and stores the model. You can see progress in the terminal.

The Architecture in Plain English

Before going further, I want to share the mental model that made everything click for me:

┌─────────────────────────────────────────────────────┐
│                    YOUR COMPUTER                    │
│                                                     │
│  ┌─────────────┐    ┌──────────────┐               │
│  │ Claude Code │───▶│    Ollama    │               │
│  │  (terminal) │    │ :11434 (API) │               │
│  └─────────────┘    └──────┬───────┘               │
│                            │                        │
│  ┌─────────────┐    ┌──────▼───────┐               │
│  │  Open WebUI │───▶│   Gemma4    │               │
│  │  (browser)  │    │  (the brain) │               │
│  └─────────────┘    └─────────────┘               │
│                                                     │
│  ┌─────────────┐                                   │
│  │  Python API │───▶ http://localhost:11434        │
│  │   scripts   │                                   │
│  └─────────────┘                                   │
└─────────────────────────────────────────────────────┘
              Zero data leaves your machine

Three different interfaces. One local model. Everything private.

Context Windows — What Are They and Why Do They Matter?

One of the most important concepts I clarified was the context window — the model's working memory. It's the maximum amount of text a model can "see" at once in a conversation. Exceed it and it starts forgetting the beginning.

Here's the reality check comparison:

	Claude Sonnet 4.5	Gemma4 (local)
Context window	200,000 tokens	~8,000–32,000 tokens
Approximate pages	~150,000 words	~6,000–24,000 words
6 years of tax docs	Handles comfortably	Would overflow

Your VRAM directly affects how large a context window your local model can hold. More VRAM = more of the model loaded = bigger context available.

You can manually increase it:

ollama run gemma4 --ctx-size 32768

For single documents, images, or focused coding tasks — perfectly fine. For analyzing six years of tax filings all at once? That's where Claude's 200k context is a genuine advantage local models can't match yet.

Can Local Models Search the Internet?

Short answer: No, not by default.

Local models are frozen at their training date. They have no internet connection during your conversation. This was an important distinction to understand.

Claude (this chat)  →  Has web search tool  →  Knows current events ✅
Gemma4 (local)     →  No internet          →  Knowledge frozen at training ❌

This raised an interesting follow-up question though. When I used Gemini to analyze my tax filing and it spotted mistakes — was it searching the internet to find them?

No. And this was a real misconception I had.

Gemini found tax errors because tax law, IRS rules, and common filing mistakes were baked into the model during training. It learned from millions of tax documents, accounting textbooks, and IRS publications. During your session it's not googling anything — it's applying trained knowledge to your specific document.

Think of it like a tax accountant. They studied tax law for years. When reviewing your return they're not searching Google — they're applying what they already know to what you show them.

Local models work the same way. The difference is:

Gemini/Claude: More recent training data, larger knowledge base, up-to-date tax law changes
Gemma4 local: Good foundational knowledge, may be slightly behind on very recent rule changes, but your documents never leave your machine

For sensitive financial documents, that privacy trade-off is significant.

Connecting Claude Code to Gemma4

This was surprisingly simple. Claude Code reads three environment variables:

export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""
export ANTHROPIC_BASE_URL=http://localhost:11434

Or using Ollama's built-in launcher:

ollama launch claude

When Claude Code started up I saw this at the bottom of the welcome screen:

gemma4 · API Usage Billing · pranayraavi@gmail.com's Organization

That confirms it's using Gemma4 through Ollama. No Anthropic billing. No subscription.

What you get with this setup:

✅ File reading and editing across your project
✅ Terminal command execution
✅ Multi-step agentic coding tasks
✅ Git operations
✅ MCP connectors and plugins
✅ Project context awareness
⚠️ Intelligence capped at Gemma4's capability (weaker than Claude Sonnet/Opus)

The Python API Test

Before setting up a GUI I wanted to confirm the raw API worked. Here's the script I wrote:

import requests

def chat(prompt):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "gemma4",
            "prompt": prompt,
            "stream": False
        }
    )
    return response.json()["response"]

print(chat("Write a hello world in ascii diagram of moon and earth"))

Output:

          (           )
         /              \
  ----(---O---)    (------)  <-- Orbit Path
 /  /   \    /  /   \
|   |     | | |     |   |

Gemma4, running entirely on my machine, responding to a Python script. No API key. No internet. Completely local. This was the moment it really clicked.

Setting Up Open WebUI — The ChatGPT-Like Interface

For a proper GUI I went with Open WebUI — a beautiful, feature-rich interface that runs locally and connects to Ollama.

First attempt using pip failed because I had Python 3.13 and Open WebUI requires Python 3.11 or 3.12:

ERROR: Could not find a version that satisfies the requirement open-webui

So I went the Docker route instead.

Installing Docker Desktop

Docker Desktop is free for personal use. Download from docker.com/products/docker-desktop. During install, WSL 2 backend gets configured automatically on Windows.

Running Open WebUI

docker run -d `
  -p 127.0.0.1:3000:8080 `
  --name open-webui `
  -v open-webui:/app/backend/data `
  --add-host=host.docker.internal:host-gateway `
  ghcr.io/open-webui/open-webui:main

I initially tried -p 3000:80 which caused a port conflict (another process was using port 3000 on my machine). Switching to -p 127.0.0.1:3000:8080 fixed it.

Confirmed it was running:

netstat -ano | findstr :3000
# TCP  0.0.0.0:3000  LISTENING  ← Docker up and running

curl http://localhost:3000
# StatusCode: 200 OK  ← Server responding

Then opened http://localhost:3000 in Chrome and saw the Open WebUI interface with Gemma4 auto-detected.

First Real Test — Image to Text Extraction

One of the reasons I picked Gemma4 over Qwen3.6 was its multimodal capability — it can actually see images. I put this to the test immediately.

I had a photo of handwritten chess notes and uploaded it directly into the Open WebUI chat. The prompt was simple: "convert this image to text".

Gemma4 thought for 11 seconds and returned:

FORK/DOUBLE ATTACK

When we attack two or more pieces at the same time then it is known
as fork or double attack

Note- Knights are good at making fork.

That's a perfect transcription of handwritten text — extracted entirely locally, no cloud OCR service, no API key, nothing leaving my machine. It even generated a relevant follow-up suggestion: "Are there other kinds of tactical attacks besides forks, like pins or skewers?"

This is the multimodal capability in action:

✅ Handwritten text extracted accurately
✅ Context understood (chess notes)
✅ Intelligent follow-up suggested
✅ 100% local — image never left my PC
✅ Free

For anyone with scanned documents, handwritten notes, receipts, or any image containing text — this works out of the box with Gemma4 in Open WebUI.

Document Upload and RAG — How It Actually Works

One of the most powerful features of Open WebUI is document upload with RAG (Retrieval Augmented Generation). This is how you can upload your AWS docs, tax returns, or any PDFs and chat with them.

Here's what happens under the hood:

You upload PDF
      ↓
Open WebUI splits it into chunks
      ↓
Converts chunks to embeddings (mathematical vectors)
      ↓
Stores in ChromaDB (local vector database)
      ↓
You ask a question
      ↓
ChromaDB finds the most relevant chunks
      ↓
Sends chunks to Gemma4 as context
      ↓
Gemma4 answers based on YOUR document

Everything is stored locally at:

C:\Users\lavan\AppData\Roaming\open-webui\data\
  📁 vector_db    ← document embeddings (ChromaDB)
  📁 uploads      ← original files
  📄 webui.db     ← chat history (SQLite)

Your documents never leave your machine. ChromaDB is completely free and open source.

One important limitation: RAG finds relevant chunks, not the entire document. If an answer spans many sections of a large document, it might miss some context. The workaround is to upload smaller, focused documents rather than one giant PDF.

The Full Stack — What I Now Have Running

✅ Ollama          — model manager and local API server
✅ Gemma4          — the AI model (multimodal, ~12GB)
✅ Claude Code     — agentic coding with local model
✅ Open WebUI      — browser-based chat interface with document upload
✅ Python API      — scripts calling the model directly

Total monthly cost: $0

When to Use What

After going through all of this, here's the practical split I settled on:

Task	Use
Coding with file editing	Claude Code + Gemma4
Image analysis / image to text	Open WebUI + Gemma4
Document Q&A (private)	Open WebUI + RAG + Gemma4
Web research / current events	Claude.ai or Perplexity
Complex reasoning / large context	Claude.ai (paid)
Tax doc analysis (all years)	Claude.ai or NotebookLM
Quick Python scripts calling AI	Direct Ollama API

Honest Reflections

What surprised me: How straightforward the setup actually was once I understood the mental model. Ollama is the server, the model is the brain, everything else just connects to it.

What I underestimated: The quality gap between local models and Claude Sonnet/Opus is real. For simple tasks Gemma4 is impressive. For complex multi-step reasoning, Claude's frontier models are noticeably stronger.

What I'd tell myself at the start: Local AI is not a replacement for cloud AI — it's a complement. Use local for private, repetitive, or experimental tasks. Use cloud AI for research, complex reasoning, and anything that benefits from a larger context window.

The privacy win is real: For sensitive documents — financial records, personal data, proprietary code — local AI is genuinely better from a privacy standpoint. Your data does not leave your machine. Full stop.

Resources

Ollama: ollama.com
Open WebUI: openwebui.com
Claude Code: claude.ai/code
Ollama + Claude Code docs: docs.ollama.com/integrations/claude-code
Docker Desktop (free): docker.com/products/docker-desktop

All of this runs on a Windows machine with 32GB RAM, an NVIDIA GPU with ~11GB VRAM, and a Core i9 processor. If you have similar hardware you can replicate this entire stack in an afternoon.

DEV Community

How I Built a Completely Free Local AI Stack — Inspired by a 60-Second YouTube Short

How I Built a Completely Free Local AI Stack — Inspired by a 60-Second YouTube Short

First Question: Wait, Is This Actually Free?

So What Exactly Is Ollama?

Downloading Your First Model — Which One?

The Architecture in Plain English

Context Windows — What Are They and Why Do They Matter?

Can Local Models Search the Internet?

Connecting Claude Code to Gemma4

The Python API Test

Setting Up Open WebUI — The ChatGPT-Like Interface

Installing Docker Desktop

Running Open WebUI

First Real Test — Image to Text Extraction

Document Upload and RAG — How It Actually Works

The Full Stack — What I Now Have Running

When to Use What

Honest Reflections

Resources

Top comments (0)