Jovan Chan

Posted on Jun 1 • Originally published at runaihome.com

Ollama for Non-Programmers: Run Local AI on Windows Without Code (2026)

#ollama #beginners #windows #gui

This article was originally published on runaihome.com

Most local-AI tutorials assume you already use a terminal, write Python, or are comfortable with Docker. That assumption excludes 90% of the people who actually want to run models locally — the content creators worried about uploading drafts to OpenAI, the designers who want a private idea-bouncing partner, the students who need to summarize 200-page PDFs without sending them to a third party.

This guide is for those people. We'll get Ollama running on Windows, then switch entirely to GUI tools so you never need the terminal again after the first 30 seconds. No Python, no Docker, no environment variables.

The 30-second terminal exception

You will open the terminal exactly once, during installation. After that, three different GUI tools take over.

Download Ollama from ollama.com/download/windows. The installer is named OllamaSetup.exe. Double-click and install. No advanced settings to configure, and Ollama bundles the CUDA libraries it needs — you do not need to install the CUDA toolkit separately.
After install, you'll see a small llama icon in the system tray (bottom-right corner of the Windows taskbar). If you don't see it, search "Ollama" in the Start menu and launch it once.
To verify it works: press Win+R, type cmd, press enter. In the black window, type ollama run gemma3:1b and hit enter. You'll see a download progress bar, then a >>> prompt. Type hello, press enter. The model responds. Type /bye to exit.

That's the last time you need the terminal. Close it.

Three GUI options, ranked by friction

LM Studio — recommended for non-programmers

lmstudio.ai gives you a complete graphical workflow: search models on the left panel, download in the middle, chat on the right. It does not require any Ollama setup at all — LM Studio has its own model registry. This means models you download in LM Studio are stored separately from Ollama's models, so avoid running both unless you have the disk space to spare.

For most users, LM Studio is the answer. Download. Install. Click the search icon, type "qwen3" or "gemma3", and look for the green check mark next to each variant. Green means your hardware can run it; red means you don't have enough VRAM. Click "Download" on a green variant. When it finishes, click the chat icon, select the model, and start typing.

One useful LM Studio feature non-programmers often miss: "GPU Offload Layers" in the Hardware Settings panel. Setting this to 999 forces LM Studio to push as many model layers to the GPU as will fit. LM Studio's own documentation notes that exceeding VRAM causes it to spill layers into system RAM, which can be up to 30× slower. If responses feel painfully slow, check that setting first before assuming your GPU isn't capable.

Open WebUI Desktop 0.9.0 — for ChatGPT-like UI

Open WebUI is known for its server version that requires Docker. Most non-programmers should not touch that. But Open WebUI 0.9.0, released in April 2026, ships a standalone Windows desktop app with no Docker required and zero telemetry.

The interface is the closest local equivalent to ChatGPT's web UI: you can upload PDFs and have the model read them, save conversation histories, switch between models on the fly, and access a floating chat bar anywhere on your screen with Shift+Ctrl+I. The downside is it takes around 30 seconds to launch each time, as it spins up a local server in the background.

Download the EXE from the Open WebUI desktop releases, install it, and connect it to your running Ollama instance. Open WebUI finds Ollama automatically at localhost:11434 — no configuration needed if Ollama is already running.

Page Assist — for browser-only workflows

If your computer is short on disk space and you don't want another desktop app installed, Page Assist is a Chrome/Edge extension (~1MB) that runs in a sidebar and connects to your existing Ollama installation. The UI is more basic than LM Studio, but the friction is lowest of the three — you stay in the browser, and there is nothing to install or update outside of the extension itself.

Picking a model that fits your hardware

The number-one question from beginners: "what model can my computer actually run?" The answer depends on a single number: your GPU's VRAM (or system RAM if you have no GPU). At Q4 quantization — the compressed format Ollama uses by default — a model needs roughly 0.6–0.7GB per billion parameters, plus about 1–2GB of overhead. A 7B model therefore needs around 5–6GB; a 14B model needs around 9–10GB.

Hardware	VRAM / RAM available	Comfortable size	Recommended model
Integrated GPU or no dedicated GPU	8–16GB system RAM	1B	Gemma 3 1B
GTX 1060 / RTX 2060	6GB VRAM	3–4B	Qwen 3 4B
RTX 3060 12GB	12GB VRAM	7–8B	Llama 3.2 8B
RTX 4060 Ti 16GB	16GB VRAM	13–14B	Qwen 3 14B
RTX 4080 / 4090	16–24GB VRAM	30B+	Qwen 3 32B

If your card's VRAM is exceeded, Ollama will offload some layers to system RAM and keep running — but speed will drop dramatically. The green/red indicator in LM Studio reflects this boundary precisely, which is one reason it's the recommended starting point. For a deeper explanation of how quantization, context length, and VRAM interact, see our GPU buying guide for local AI.

Picking by task, not just hardware

Once you know what fits, pick by use case:

English writing / editing: Llama 3.2 or Qwen 3. Both handle nuanced rewrites and tone adjustments well.
Code review or explanation (even if you don't code yourself): Qwen 2.5 Coder. Trained specifically on code; much better than general models at explaining what a snippet does in plain English.
PDF summarization: Gemma 3 4B with Open WebUI Desktop. Gemma 3 4B, 12B, and 27B support a 128k context window (the 1B model uses 32k), which handles most PDF-length documents; Open WebUI handles the upload.
Image description / alt text: Llama 3.2 Vision or LLaVA. These are multimodal — they accept images as input alongside text.
Casual conversation / roleplay: MythoMax or Dolphin-Mistral. Community-tuned for natural dialog rather than instruction-following.
Chinese or bilingual text: Qwen 3 family. Alibaba's training emphasis shows in tone and vocabulary for Mandarin-heavy workloads.

Managing your models over time

Ollama stores downloaded models at C:\Users\<your-username>\.ollama\models. After downloading a few, this folder grows fast — Qwen 3 14B at Q4 is around 8.2GB, Llama 3.2 8B is around 5GB. A few habits that prevent disk sprawl:

Remove models you don't use: open a terminal once and run ollama list to see everything installed, then ollama rm model-name to delete one. You can always re-download later.

LM Studio stores models separately: if you're running both Ollama and LM Studio, models are not shared between them. Check C:\Users\<username>\.lmstudio\models if you want to see what LM Studio has stored.

Model updates are manual: Ollama doesn't auto-update downloaded models. To get a newer version, run ollama pull model-name — it checks whether the latest version differs from what you have and only downloads the changed parts.

Ten common errors and what to do about them

"CUDA error" or "out of memory" — the model is larger than your VRAM. Switch to a smaller variant (qwen3:4b instead of qwen3:14b) or request Q4 quantization explicitly by appending :q4_K_M to the model name when pulling.

**Replies

DEV Community