Kunal

Posted on Apr 24 • Originally published at kunalganglani.com

Run Gemma 3 Locally on Windows: The VRAM Guide Nobody Gave You [2026]

#gemma #localllm #googleai #windows

Last month I burned an entire Saturday trying to get Gemma 3 running on my Windows workstation. What should have been a 15-minute setup turned into four hours of driver mismatches, VRAM errors, and cryptic Ollama crashes. If you want to run Gemma 3 locally on a Windows PC with an NVIDIA GPU, this guide exists so you don't repeat my mistakes.

Google's Gemma 3 family is one of the most capable open-weight model families available right now. It ships in four sizes — 1B, 4B, 12B, and 27B parameters — and the smaller variants run surprisingly well on consumer hardware. But "runs on consumer hardware" and "runs well on your consumer hardware" are very different claims. The thing standing between them is almost always VRAM.

What Is Gemma 3 and Why Run It Locally?

Gemma 3 is Google's open-weight model family, built from the same research behind the Gemini models. As Tris Warkentin, Director of Product Management at Google, explained during the original Gemma announcement, these models are designed to be accessible while maintaining responsible AI principles. The Gemma 3 generation expanded the lineup significantly. The larger variants got multimodal capabilities (vision plus text), and the performance ceiling moved well beyond what the original Gemma 2B and 7B could touch.

Why run it locally instead of hitting an API? Three reasons that matter if you're a working engineer:

Privacy. Your prompts, your data, your code never leave your machine. If you're working with proprietary codebases or sensitive documents, this isn't a nice-to-have. It's a hard requirement.
Latency. No network round-trip. For interactive use cases like coding assistants or local RAG pipelines, you feel the difference immediately.
Cost. After the upfront GPU investment, inference is free. I've been running local models for months and the electricity cost is a rounding error compared to API bills.

I've written before about running local LLMs and the hardware you actually need. Gemma 3 follows the same principles, but Windows has some specific quirks worth calling out.

How Much VRAM Does Gemma 3 Actually Need?

This is the question everyone asks first, and most guides hand-wave with "it depends." Here are the actual numbers. VRAM requirements come down to two things: model size and quantization level.

Full precision (FP16/BF16) means loading the model exactly as Google trained it. Quantization compresses the weights — 8-bit (Q8) roughly halves the memory, 4-bit (Q4) cuts it to about a quarter. You lose some quality with each step down, but for most practical tasks, Q4 gets you shockingly close to full precision output.

Model	FP16 VRAM	Q8 VRAM	Q4 VRAM	Minimum GPU
Gemma 3 1B	~2.5 GB	~1.5 GB	~1 GB	Any modern GPU
Gemma 3 4B	~9 GB	~5 GB	~3 GB	RTX 4060 (8GB)
Gemma 3 12B	~25 GB	~13 GB	~8 GB	RTX 4070 Ti (12GB)
Gemma 3 27B	~55 GB	~28 GB	~16 GB	RTX 4090 (24GB)

Here's what this table actually tells you: most developers with a mid-range NVIDIA card can comfortably run Gemma 3 4B at full precision or Gemma 3 12B at Q4 quantization. The 27B model is where you need serious hardware. You're looking at an RTX 4090 or a professional card to run it without constant memory pressure.

One thing I learned the hard way: VRAM usage in practice is always higher than the model weight size alone. The inference runtime, KV cache, and whatever context you're holding all eat into available memory. Budget an extra 1-2 GB beyond what the weights require. I didn't the first time, and spent an hour staring at out-of-memory errors wondering what I was doing wrong.

Setting Up Ollama on Windows With NVIDIA

Ollama is the simplest way to get Gemma 3 running locally on Windows. It bundles model management, the inference runtime, and CUDA GPU detection into a single installer. No conda environments, no pip dependency hell, no manual CUDA toolkit installation.

Here's the setup that actually works:

Step 1: Update your NVIDIA drivers. Everyone skips this and then spends two hours debugging. You need the latest Game Ready or Studio drivers — not whatever shipped with your card six months ago. Open GeForce Experience or go to nvidia.com/drivers and grab the current version. As the Datacamp Ollama tutorial notes, Ollama automatically detects and uses your GPU if the drivers are current. If they're not, it silently falls back to CPU inference. You'll wonder why everything is unbearably slow. Ask me how I know.

Step 2: Install Ollama. Download the Windows installer from ollama.com. Standard .exe — run it, follow the prompts, done. Ollama runs as a background service and exposes a local API on port 11434.

Step 3: Pull a Gemma 3 model. Open a terminal and run ollama pull gemma3:4b for the 4B parameter model. You can also pull gemma3:1b, gemma3:12b, or gemma3:27b depending on your VRAM budget. Ollama downloads quantized versions by default (typically Q4_K_M), which is the right call for most consumer GPUs.

Step 4: Run it. Type ollama run gemma3:4b and you're in an interactive chat. That's it. Seriously.

If you've previously explored building a portable LLM on a USB stick, you'll recognize Ollama as the engine behind that setup. Same tool here, just with beefier models and GPU acceleration.

The Windows-Specific Pitfalls That Will Waste Your Time

This is where the "just install Ollama" advice falls apart. Windows introduces friction that Linux users never deal with. I've hit most of these personally.

The CUDA detection failure. Ollama relies on detecting your GPU via CUDA. If you have multiple display adapters (super common on laptops with Intel integrated graphics plus an NVIDIA discrete GPU), Ollama sometimes picks the wrong one. The fix: open your NVIDIA Control Panel, go to "Manage 3D Settings," and set the preferred graphics processor to your NVIDIA GPU globally. Restart the Ollama service after.

Windows Defender real-time scanning. This one is insidious. When Ollama downloads a multi-gigabyte model file, Windows Defender scans it in real-time. Downloads slow to a crawl. Sometimes they timeout entirely. I've watched model pulls that should take 5 minutes stretch to 30 for no apparent reason. Add Ollama's model directory (typically C:\Users\<you>\.ollama\models) to Defender's exclusion list. Problem gone.

The "not enough memory" error that lies. Sometimes Ollama reports insufficient memory even when nvidia-smi shows plenty of free VRAM. This usually means another process is holding a CUDA context — Chrome with hardware acceleration is the usual culprit, but Discord and even your desktop compositor can do it too. Close GPU-hungry apps before loading large models. I keep a batch script that kills the usual suspects before starting an inference session.

WSL2 conflicts. If you're running WSL2 with GPU passthrough enabled, it competes with native Windows Ollama for VRAM. Pick one or the other. Running Ollama inside WSL2 works fine, but don't try both simultaneously.

The single biggest time-saver for local LLM work on Windows: keep your NVIDIA drivers current and close Chrome before loading models. It sounds absurdly simple. It resolves about 80% of the support questions I see.

Performance Tweaks That Actually Matter

Once Gemma 3 is running, there's a gap between "it works" and "it works well." Here's what's worth your time and what isn't.

Choose the right quantization level. Q4_K_M is the sweet spot for most people. It's the default Ollama quantization and offers the best tradeoff between quality and speed. Q8 is noticeably better for complex reasoning tasks but needs roughly double the VRAM. I ran both side by side for a week on coding tasks. The Q4 version handled 90% of what I threw at it without meaningful quality loss. The other 10% was edge-case reasoning where I'd probably have reached for Claude anyway.

Context length matters more than you'd expect. Gemma 3 supports context windows up to 128K tokens on the larger models, but longer contexts eat more VRAM for the KV cache. If you're running near your VRAM ceiling, reduce the context window. You can set this in Ollama by creating a custom Modelfile with a num_ctx parameter. For most interactive use, 4096-8192 tokens is plenty.

Monitor with nvidia-smi. Run it in a separate terminal to watch VRAM usage and GPU utilization in real-time. If GPU utilization is low but VRAM is full, you're memory-bound. If VRAM is half-empty but utilization is also low, something's broken in your CUDA setup.

For reference, the NVIDIA Developer Blog reported that the earlier Gemma 7B model hit 93 tokens per second at FP16 on an RTX 4090 with TensorRT-LLM optimizations. Gemma 3's improved architecture gets you in the same ballpark at equivalent parameter counts, and the 4B model is faster still. On my RTX 3080 with Q4 quantization, I consistently see 40-50 tokens per second on the 4B model. Fast enough that responses feel instantaneous.

If you're curious how local inference stacks up against cloud APIs, I ran those numbers in my local LLM vs Claude coding benchmark. The short version: local wins on latency and privacy, cloud wins on raw capability. Gemma 3 narrows that capability gap more than anything before it.

Gemma 3 vs the Competition: Where It Fits

Gemma 3 isn't the only open model worth running locally. Meta's Llama 3, Alibaba's Qwen 3, and Mistral's models all compete here. My honest take after spending real time with all of them.

Gemma 3 27B punches above its weight class. On reasoning and instruction-following benchmarks, it competes with models twice its size. The 4B model is particularly impressive — it's become my daily driver for quick coding questions and text processing tasks. I reach for it probably 15-20 times a day.

Where Gemma 3 wins: instruction following, multilingual support, and the multimodal capabilities on the 12B and 27B variants. Passing an image along with a text prompt and getting coherent analysis back is a real differentiator for local models.

Where it falls short: raw creative writing (Llama 3 is better here) and very long context coherence (Qwen 3 handles this more gracefully at equivalent sizes). For coding specifically, the 27B model is competitive, but I'd still reach for a specialized code model if that's your primary use case.

The real advantage is Google's ecosystem. Gemma 3 works seamlessly with Ollama, has first-class support in LM Studio and llama.cpp, and Google actively maintains the weights and documentation. That matters when you're building something you need to depend on six months from now.

Here's a video walkthrough that covers the setup process visually if that's more your speed:

[YOUTUBE:7LEvSOiTWZk|Google Gemma 4 Tutorial - Run AI Locally for Free]

What Comes Next

Google's docs already list a Gemma 4 model card in their navigation. That signals the next generation is either imminent or already rolling out as you read this. If the jump from Gemma 2 to 3 is any indication, expect better efficiency at every parameter count. The models that demand a 4090 today might run on a 4070 tomorrow.

The trajectory is clear: every generation of open models gets more capable at lower VRAM requirements. I think we're 12-18 months away from a genuinely GPT-4-class model running on an 8GB consumer GPU. That's not hype. It's where the quantization research and architecture improvements are pointing.

If you haven't set up a local inference stack yet, Gemma 3 on Ollama is the lowest-friction starting point. The 4B model runs on hardware most developers already own. And having a capable LLM respond in milliseconds with zero API costs changes how you work in ways you don't anticipate until you've lived with it for a week.

Stop paying per token for tasks a local model handles just as well. Your GPU is sitting there idle. Put it to work.

Originally published on kunalganglani.com

DEV Community