This article was originally published on runaihome.com
Most local-AI tutorials assume you already use a terminal, write Python, or are comfortable with Docker. That assumption excludes 90% of the people who actually want to run models locally — the content creators worried about uploading drafts to OpenAI, the designers who want a private idea-bouncing partner, the students who need to summarize 200-page PDFs without sending them to a third party.
This guide is for those people. We'll get Ollama running on Windows, then switch entirely to GUI tools so you never need the terminal again after the first 30 seconds. No Python, no Docker, no environment variables.
The 30-second terminal exception
You will open the terminal exactly once, during installation. After that, three different GUI tools take over.
- Download Ollama from ollama.com/download/windows. The installer is named
OllamaSetup.exe. Double-click and install. No advanced settings to configure, and Ollama bundles the CUDA libraries it needs — you do not need to install the CUDA toolkit separately. - After install, you'll see a small llama icon in the system tray (bottom-right corner of the Windows taskbar). If you don't see it, search "Ollama" in the Start menu and launch it once.
- To verify it works: press
Win+R, typecmd, press enter. In the black window, typeollama run gemma3:1band hit enter. You'll see a download progress bar, then a>>>prompt. Typehello, press enter. The model responds. Type/byeto exit.
That's the last time you need the terminal. Close it.
Three GUI options, ranked by friction
LM Studio — recommended for non-programmers
lmstudio.ai gives you a complete graphical workflow: search models on the left panel, download in the middle, chat on the right. It does not require any Ollama setup at all — LM Studio has its own model registry. This means models you download in LM Studio are stored separately from Ollama's models, so avoid running both unless you have the disk space to spare.
For most users, LM Studio is the answer. Download. Install. Click the search icon, type "qwen3" or "gemma3", and look for the green check mark next to each variant. Green means your hardware can run it; red means you don't have enough VRAM. Click "Download" on a green variant. When it finishes, click the chat icon, select the model, and start typing.
One useful LM Studio feature non-programmers often miss: "GPU Offload Layers" in the Hardware Settings panel. Setting this to 999 forces LM Studio to push as many model layers to the GPU as will fit. LM Studio's own documentation notes that exceeding VRAM causes it to spill layers into system RAM, which can be up to 30× slower. If responses feel painfully slow, check that setting first before assuming your GPU isn't capable.
Open WebUI Desktop 0.9.0 — for ChatGPT-like UI
Open WebUI is known for its server version that requires Docker. Most non-programmers should not touch that. But Open WebUI 0.9.0, released in April 2026, ships a standalone Windows desktop app with no Docker required and zero telemetry.
The interface is the closest local equivalent to ChatGPT's web UI: you can upload PDFs and have the model read them, save conversation histories, switch between models on the fly, and access a floating chat bar anywhere on your screen with Shift+Ctrl+I. The downside is it takes around 30 seconds to launch each time, as it spins up a local server in the background.
Download the EXE from the Open WebUI desktop releases, install it, and connect it to your running Ollama instance. Open WebUI finds Ollama automatically at localhost:11434 — no configuration needed if Ollama is already running.
Page Assist — for browser-only workflows
If your computer is short on disk space and you don't want another desktop app installed, Page Assist is a Chrome/Edge extension (~1MB) that runs in a sidebar and connects to your existing Ollama installation. The UI is more basic than LM Studio, but the friction is lowest of the three — you stay in the browser, and there is nothing to install or update outside of the extension itself.
Picking a model that fits your hardware
The number-one question from beginners: "what model can my computer actually run?" The answer depends on a single number: your GPU's VRAM (or system RAM if you have no GPU). At Q4 quantization — the compressed format Ollama uses by default — a model needs roughly 0.6–0.7GB per billion parameters, plus about 1–2GB of overhead. A 7B model therefore needs around 5–6GB; a 14B model needs around 9–10GB.
| Hardware | VRAM / RAM available | Comfortable size | Recommended model |
|---|---|---|---|
| Integrated GPU or no dedicated GPU | 8–16GB system RAM | 1B | Gemma 3 1B |
| GTX 1060 / RTX 2060 | 6GB VRAM | 3–4B | Qwen 3 4B |
| RTX 3060 12GB | 12GB VRAM | 7–8B | Llama 3.2 8B |
| RTX 4060 Ti 16GB | 16GB VRAM | 13–14B | Qwen 3 14B |
| RTX 4080 / 4090 | 16–24GB VRAM | 30B+ | Qwen 3 32B |
If your card's VRAM is exceeded, Ollama will offload some layers to system RAM and keep running — but speed will drop dramatically. The green/red indicator in LM Studio reflects this boundary precisely, which is one reason it's the recommended starting point. For a deeper explanation of how quantization, context length, and VRAM interact, see our GPU buying guide for local AI.
Picking by task, not just hardware
Once you know what fits, pick by use case:
- English writing / editing: Llama 3.2 or Qwen 3. Both handle nuanced rewrites and tone adjustments well.
- Code review or explanation (even if you don't code yourself): Qwen 2.5 Coder. Trained specifically on code; much better than general models at explaining what a snippet does in plain English.
- PDF summarization: Gemma 3 4B with Open WebUI Desktop. Gemma 3 4B, 12B, and 27B support a 128k context window (the 1B model uses 32k), which handles most PDF-length documents; Open WebUI handles the upload.
- Image description / alt text: Llama 3.2 Vision or LLaVA. These are multimodal — they accept images as input alongside text.
- Casual conversation / roleplay: MythoMax or Dolphin-Mistral. Community-tuned for natural dialog rather than instruction-following.
- Chinese or bilingual text: Qwen 3 family. Alibaba's training emphasis shows in tone and vocabulary for Mandarin-heavy workloads.
Managing your models over time
Ollama stores downloaded models at C:\Users\<your-username>\.ollama\models. After downloading a few, this folder grows fast — Qwen 3 14B at Q4 is around 8.2GB, Llama 3.2 8B is around 5GB. A few habits that prevent disk sprawl:
Remove models you don't use: open a terminal once and run ollama list to see everything installed, then ollama rm model-name to delete one. You can always re-download later.
LM Studio stores models separately: if you're running both Ollama and LM Studio, models are not shared between them. Check C:\Users\<username>\.lmstudio\models if you want to see what LM Studio has stored.
Model updates are manual: Ollama doesn't auto-update downloaded models. To get a newer version, run ollama pull model-name — it checks whether the latest version differs from what you have and only downloads the changed parts.
Ten common errors and what to do about them
"CUDA error" or "out of memory" — the model is larger than your VRAM. Switch to a smaller variant (qwen3:4b instead of qwen3:14b) or request Q4 quantization explicitly by appending :q4_K_M to the model name when pulling.
**Replies
Top comments (0)