<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Pranay ravi</title>
    <description>The latest articles on DEV Community by Pranay ravi (@pranay_ravi_b88172eac205c).</description>
    <link>https://dev.to/pranay_ravi_b88172eac205c</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3372066%2F5ac87370-81a3-4df2-a2b2-6e46f3c4ede8.jpg</url>
      <title>DEV Community: Pranay ravi</title>
      <link>https://dev.to/pranay_ravi_b88172eac205c</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pranay_ravi_b88172eac205c"/>
    <language>en</language>
    <item>
      <title>How I Built a Completely Free Local AI Stack — Inspired by a 60-Second YouTube Short</title>
      <dc:creator>Pranay ravi</dc:creator>
      <pubDate>Sun, 17 May 2026 02:33:21 +0000</pubDate>
      <link>https://dev.to/pranay_ravi_b88172eac205c/how-i-built-a-completely-free-local-ai-stack-inspired-by-a-60-second-youtube-short-3e39</link>
      <guid>https://dev.to/pranay_ravi_b88172eac205c/how-i-built-a-completely-free-local-ai-stack-inspired-by-a-60-second-youtube-short-3e39</guid>
      <description>&lt;h1&gt;
  
  
  How I Built a Completely Free Local AI Stack — Inspired by a 60-Second YouTube Short
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;By Pranaychandra Ravi&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;It started with a YouTube Short. Someone on my feed casually demonstrated connecting a local AI model to Claude Code and I stopped mid-scroll. No API key. No subscription. No code leaving their machine. I had to know how it worked.&lt;/p&gt;

&lt;p&gt;What followed was a deep dive into local AI — Ollama, Gemma4, Docker, Open WebUI, vector databases, context windows, and a Python script that made my local model generate an ASCII diagram of the Earth and Moon. This post documents everything I learned, every question I asked, and every mistake I made along the way. If you're curious about running AI entirely on your own hardware, this one is for you.&lt;/p&gt;




&lt;h2&gt;
  
  
  First Question: Wait, Is This Actually Free?
&lt;/h2&gt;

&lt;p&gt;My first instinct was skepticism. Claude Code is Anthropic's product. Surely using it requires a Claude subscription?&lt;/p&gt;

&lt;p&gt;The short answer is &lt;strong&gt;no — not when you pair it with Ollama and a local model.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's what I learned: Claude Code is the &lt;em&gt;agent&lt;/em&gt; — the tool that reads your files, runs commands, edits code, and manages multi-step tasks in your terminal. By default it calls Anthropic's API, which costs money. But Claude Code exposes environment variables that let you redirect those API calls anywhere you want — including a local Ollama server running on your own machine.&lt;/p&gt;

&lt;p&gt;Ollama added official support for Anthropic's Messages API format, meaning Claude Code can talk to it natively. No hacks, no middleware, no subscription. The only cost is your own electricity and hardware.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Claude Code  →  talks to  →  Ollama (local server)  →  runs  →  Your model
                              (no Anthropic servers involved)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  So What Exactly Is Ollama?
&lt;/h2&gt;

&lt;p&gt;Before I could set anything up I needed to understand what Ollama actually is, because "install Ollama" doesn't tell you much.&lt;/p&gt;

&lt;p&gt;Think of Ollama as two things in one:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. A model manager&lt;/strong&gt; — it downloads, stores, and organizes AI models on your machine. Like a package manager but for AI brains.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. A local API server&lt;/strong&gt; — once running, it exposes an endpoint at &lt;code&gt;http://localhost:11434&lt;/code&gt; that any application can call. Your code, Claude Code, Open WebUI, VS Code extensions — anything that speaks the Anthropic or OpenAI API format can connect to it.&lt;/p&gt;

&lt;p&gt;This is the key insight I kept coming back to: &lt;strong&gt;Ollama itself has no intelligence&lt;/strong&gt;. It's an empty engine. You have to download a model — a large file containing all the AI's weights and knowledge — before anything useful happens.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Without a model:   Ollama = empty server, useless
With a model:      Ollama = fully local AI, free forever
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Downloading Your First Model — Which One?
&lt;/h2&gt;

&lt;p&gt;This is where hardware matters. I have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;32GB RAM&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;NVIDIA GPU with ~11GB VRAM&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Core i9 processor&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With an NVIDIA card, Ollama automatically uses CUDA — no setup needed. Your GPU handles inference and it's dramatically faster than CPU-only.&lt;/p&gt;

&lt;p&gt;The key concept here is &lt;strong&gt;VRAM vs RAM&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Model fits in VRAM  →  GPU handles everything  →  Very fast ✅
Model too big for VRAM  →  spills into system RAM  →  Slower ⚠️
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With 11GB VRAM I can fit most 7B–13B parameter models entirely in GPU memory, which means fast, snappy responses.&lt;/p&gt;

&lt;p&gt;After thinking through my use cases — coding help, image analysis, document review — I landed on &lt;strong&gt;Gemma4&lt;/strong&gt; (Google's multimodal model, ~12GB). Here's why it beat out alternatives like Qwen3.6 (28GB):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Gemma4&lt;/th&gt;
&lt;th&gt;Qwen3.6&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Size&lt;/td&gt;
&lt;td&gt;~12GB&lt;/td&gt;
&lt;td&gt;~28GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fits in 11GB VRAM&lt;/td&gt;
&lt;td&gt;Nearly (tiny RAM overflow)&lt;/td&gt;
&lt;td&gt;Partial (big RAM spill)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image understanding&lt;/td&gt;
&lt;td&gt;✅ Yes (multimodal)&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coding quality&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Better&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speed on my hardware&lt;/td&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;td&gt;Slower&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;My use cases included image-to-text extraction and converting images to coloring pages — Qwen3.6 can't do either because it's text-only. Gemma4 won.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama pull gemma4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One command. It downloads, verifies, and stores the model. You can see progress in the terminal.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture in Plain English
&lt;/h2&gt;

&lt;p&gt;Before going further, I want to share the mental model that made everything click for me:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────┐
│                    YOUR COMPUTER                    │
│                                                     │
│  ┌─────────────┐    ┌──────────────┐               │
│  │ Claude Code │───▶│    Ollama    │               │
│  │  (terminal) │    │ :11434 (API) │               │
│  └─────────────┘    └──────┬───────┘               │
│                            │                        │
│  ┌─────────────┐    ┌──────▼───────┐               │
│  │  Open WebUI │───▶│   Gemma4    │               │
│  │  (browser)  │    │  (the brain) │               │
│  └─────────────┘    └─────────────┘               │
│                                                     │
│  ┌─────────────┐                                   │
│  │  Python API │───▶ http://localhost:11434        │
│  │   scripts   │                                   │
│  └─────────────┘                                   │
└─────────────────────────────────────────────────────┘
              Zero data leaves your machine
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three different interfaces. One local model. Everything private.&lt;/p&gt;




&lt;h2&gt;
  
  
  Context Windows — What Are They and Why Do They Matter?
&lt;/h2&gt;

&lt;p&gt;One of the most important concepts I clarified was the &lt;strong&gt;context window&lt;/strong&gt; — the model's working memory. It's the maximum amount of text a model can "see" at once in a conversation. Exceed it and it starts forgetting the beginning.&lt;/p&gt;

&lt;p&gt;Here's the reality check comparison:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Claude Sonnet 4.5&lt;/th&gt;
&lt;th&gt;Gemma4 (local)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Context window&lt;/td&gt;
&lt;td&gt;200,000 tokens&lt;/td&gt;
&lt;td&gt;~8,000–32,000 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Approximate pages&lt;/td&gt;
&lt;td&gt;~150,000 words&lt;/td&gt;
&lt;td&gt;~6,000–24,000 words&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6 years of tax docs&lt;/td&gt;
&lt;td&gt;Handles comfortably&lt;/td&gt;
&lt;td&gt;Would overflow&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Your VRAM directly affects how large a context window your local model can hold. More VRAM = more of the model loaded = bigger context available.&lt;/p&gt;

&lt;p&gt;You can manually increase it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run gemma4 &lt;span class="nt"&gt;--ctx-size&lt;/span&gt; 32768
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For single documents, images, or focused coding tasks — perfectly fine. For analyzing six years of tax filings all at once? That's where Claude's 200k context is a genuine advantage local models can't match yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  Can Local Models Search the Internet?
&lt;/h2&gt;

&lt;p&gt;Short answer: &lt;strong&gt;No, not by default.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Local models are frozen at their training date. They have no internet connection during your conversation. This was an important distinction to understand.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Claude (this chat)  →  Has web search tool  →  Knows current events ✅
Gemma4 (local)     →  No internet          →  Knowledge frozen at training ❌
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This raised an interesting follow-up question though. When I used Gemini to analyze my tax filing and it spotted mistakes — was it searching the internet to find them?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No.&lt;/strong&gt; And this was a real misconception I had.&lt;/p&gt;

&lt;p&gt;Gemini found tax errors because tax law, IRS rules, and common filing mistakes were baked into the model during training. It learned from millions of tax documents, accounting textbooks, and IRS publications. During your session it's not googling anything — it's applying trained knowledge to your specific document.&lt;/p&gt;

&lt;p&gt;Think of it like a tax accountant. They studied tax law for years. When reviewing your return they're not searching Google — they're applying what they already know to what you show them.&lt;/p&gt;

&lt;p&gt;Local models work the same way. The difference is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gemini/Claude&lt;/strong&gt;: More recent training data, larger knowledge base, up-to-date tax law changes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemma4 local&lt;/strong&gt;: Good foundational knowledge, may be slightly behind on very recent rule changes, but &lt;strong&gt;your documents never leave your machine&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For sensitive financial documents, that privacy trade-off is significant.&lt;/p&gt;




&lt;h2&gt;
  
  
  Connecting Claude Code to Gemma4
&lt;/h2&gt;

&lt;p&gt;This was surprisingly simple. Claude Code reads three environment variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_AUTH_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ollama
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:11434
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or using Ollama's built-in launcher:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama launch claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When Claude Code started up I saw this at the bottom of the welcome screen:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gemma4 · API Usage Billing · pranayraavi@gmail.com's Organization
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That confirms it's using Gemma4 through Ollama. No Anthropic billing. No subscription.&lt;/p&gt;

&lt;p&gt;What you get with this setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ File reading and editing across your project&lt;/li&gt;
&lt;li&gt;✅ Terminal command execution&lt;/li&gt;
&lt;li&gt;✅ Multi-step agentic coding tasks&lt;/li&gt;
&lt;li&gt;✅ Git operations&lt;/li&gt;
&lt;li&gt;✅ MCP connectors and plugins&lt;/li&gt;
&lt;li&gt;✅ Project context awareness&lt;/li&gt;
&lt;li&gt;⚠️ Intelligence capped at Gemma4's capability (weaker than Claude Sonnet/Opus)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Python API Test
&lt;/h2&gt;

&lt;p&gt;Before setting up a GUI I wanted to confirm the raw API worked. Here's the script I wrote:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434/api/generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a hello world in ascii diagram of moon and earth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;          (           )
         /              \
  ----(---O---)    (------)  &amp;lt;-- Orbit Path
 /  /   \    /  /   \
|   |     | | |     |   |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gemma4, running entirely on my machine, responding to a Python script. No API key. No internet. Completely local. This was the moment it really clicked.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setting Up Open WebUI — The ChatGPT-Like Interface
&lt;/h2&gt;

&lt;p&gt;For a proper GUI I went with &lt;strong&gt;Open WebUI&lt;/strong&gt; — a beautiful, feature-rich interface that runs locally and connects to Ollama.&lt;/p&gt;

&lt;p&gt;First attempt using pip failed because I had Python 3.13 and Open WebUI requires Python 3.11 or 3.12:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;ERROR: Could not find a version that satisfies the requirement open-webui
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So I went the Docker route instead.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installing Docker Desktop
&lt;/h3&gt;

&lt;p&gt;Docker Desktop is free for personal use. Download from &lt;code&gt;docker.com/products/docker-desktop&lt;/code&gt;. During install, WSL 2 backend gets configured automatically on Windows.&lt;/p&gt;

&lt;h3&gt;
  
  
  Running Open WebUI
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;docker&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;run&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-d&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;-p&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;127.0.0.1:3000:8080&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--name&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;open-webui&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;-v&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;open-webui:/app/backend/data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--add-host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;host.docker.internal:host-gateway&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;ghcr.io/open-webui/open-webui:main&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I initially tried &lt;code&gt;-p 3000:80&lt;/code&gt; which caused a port conflict (another process was using port 3000 on my machine). Switching to &lt;code&gt;-p 127.0.0.1:3000:8080&lt;/code&gt; fixed it.&lt;/p&gt;

&lt;p&gt;Confirmed it was running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;netstat&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-ano&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;findstr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nx"&gt;3000&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c"&gt;# TCP  0.0.0.0:3000  LISTENING  ← Docker up and running&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;curl&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;http://localhost:3000&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c"&gt;# StatusCode: 200 OK  ← Server responding&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then opened &lt;code&gt;http://localhost:3000&lt;/code&gt; in Chrome and saw the Open WebUI interface with Gemma4 auto-detected.&lt;/p&gt;

&lt;h3&gt;
  
  
  First Real Test — Image to Text Extraction
&lt;/h3&gt;

&lt;p&gt;One of the reasons I picked Gemma4 over Qwen3.6 was its multimodal capability — it can actually &lt;em&gt;see&lt;/em&gt; images. I put this to the test immediately.&lt;/p&gt;

&lt;p&gt;I had a photo of handwritten chess notes and uploaded it directly into the Open WebUI chat. The prompt was simple: &lt;em&gt;"convert this image to text"&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Gemma4 thought for 11 seconds and returned:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FORK/DOUBLE ATTACK

When we attack two or more pieces at the same time then it is known
as fork or double attack

Note- Knights are good at making fork.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's a perfect transcription of handwritten text — extracted entirely locally, no cloud OCR service, no API key, nothing leaving my machine. It even generated a relevant follow-up suggestion: &lt;em&gt;"Are there other kinds of tactical attacks besides forks, like pins or skewers?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is the multimodal capability in action:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Handwritten text extracted accurately&lt;/li&gt;
&lt;li&gt;✅ Context understood (chess notes)&lt;/li&gt;
&lt;li&gt;✅ Intelligent follow-up suggested&lt;/li&gt;
&lt;li&gt;✅ 100% local — image never left my PC&lt;/li&gt;
&lt;li&gt;✅ Free&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For anyone with scanned documents, handwritten notes, receipts, or any image containing text — this works out of the box with Gemma4 in Open WebUI.&lt;/p&gt;




&lt;h2&gt;
  
  
  Document Upload and RAG — How It Actually Works
&lt;/h2&gt;

&lt;p&gt;One of the most powerful features of Open WebUI is document upload with &lt;strong&gt;RAG (Retrieval Augmented Generation)&lt;/strong&gt;. This is how you can upload your AWS docs, tax returns, or any PDFs and chat with them.&lt;/p&gt;

&lt;p&gt;Here's what happens under the hood:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You upload PDF
      ↓
Open WebUI splits it into chunks
      ↓
Converts chunks to embeddings (mathematical vectors)
      ↓
Stores in ChromaDB (local vector database)
      ↓
You ask a question
      ↓
ChromaDB finds the most relevant chunks
      ↓
Sends chunks to Gemma4 as context
      ↓
Gemma4 answers based on YOUR document
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Everything is stored locally at:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;C:\Users\lavan\AppData\Roaming\open-webui\data\
  📁 vector_db    ← document embeddings (ChromaDB)
  📁 uploads      ← original files
  📄 webui.db     ← chat history (SQLite)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your documents never leave your machine. ChromaDB is completely free and open source.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One important limitation&lt;/strong&gt;: RAG finds &lt;em&gt;relevant chunks&lt;/em&gt;, not the entire document. If an answer spans many sections of a large document, it might miss some context. The workaround is to upload smaller, focused documents rather than one giant PDF.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Full Stack — What I Now Have Running
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✅ Ollama          — model manager and local API server
✅ Gemma4          — the AI model (multimodal, ~12GB)
✅ Claude Code     — agentic coding with local model
✅ Open WebUI      — browser-based chat interface with document upload
✅ Python API      — scripts calling the model directly
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Total monthly cost: &lt;strong&gt;$0&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  When to Use What
&lt;/h2&gt;

&lt;p&gt;After going through all of this, here's the practical split I settled on:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Coding with file editing&lt;/td&gt;
&lt;td&gt;Claude Code + Gemma4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image analysis / image to text&lt;/td&gt;
&lt;td&gt;Open WebUI + Gemma4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Document Q&amp;amp;A (private)&lt;/td&gt;
&lt;td&gt;Open WebUI + RAG + Gemma4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Web research / current events&lt;/td&gt;
&lt;td&gt;Claude.ai or Perplexity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complex reasoning / large context&lt;/td&gt;
&lt;td&gt;Claude.ai (paid)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tax doc analysis (all years)&lt;/td&gt;
&lt;td&gt;Claude.ai or NotebookLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quick Python scripts calling AI&lt;/td&gt;
&lt;td&gt;Direct Ollama API&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Honest Reflections
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What surprised me&lt;/strong&gt;: How straightforward the setup actually was once I understood the mental model. Ollama is the server, the model is the brain, everything else just connects to it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I underestimated&lt;/strong&gt;: The quality gap between local models and Claude Sonnet/Opus is real. For simple tasks Gemma4 is impressive. For complex multi-step reasoning, Claude's frontier models are noticeably stronger.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I'd tell myself at the start&lt;/strong&gt;: Local AI is not a replacement for cloud AI — it's a complement. Use local for private, repetitive, or experimental tasks. Use cloud AI for research, complex reasoning, and anything that benefits from a larger context window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The privacy win is real&lt;/strong&gt;: For sensitive documents — financial records, personal data, proprietary code — local AI is genuinely better from a privacy standpoint. Your data does not leave your machine. Full stop.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Ollama: &lt;a href="https://ollama.com" rel="noopener noreferrer"&gt;ollama.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Open WebUI: &lt;a href="https://openwebui.com" rel="noopener noreferrer"&gt;openwebui.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Claude Code: &lt;a href="https://claude.ai/code" rel="noopener noreferrer"&gt;claude.ai/code&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Ollama + Claude Code docs: &lt;a href="https://docs.ollama.com/integrations/claude-code" rel="noopener noreferrer"&gt;docs.ollama.com/integrations/claude-code&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Docker Desktop (free): &lt;a href="https://docker.com/products/docker-desktop" rel="noopener noreferrer"&gt;docker.com/products/docker-desktop&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;All of this runs on a Windows machine with 32GB RAM, an NVIDIA GPU with ~11GB VRAM, and a Core i9 processor. If you have similar hardware you can replicate this entire stack in an afternoon.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>showdev</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
