Thurmon Demich

Posted on Jun 21 • Originally published at bestgpuforllm.com

How Much VRAM for Gemma 4? Every Variant Explained

#vram #gemma4 #google #llm

Cross-posted from Best GPU for LLM — visit the original for our VRAM calculator, GPU comparison table, and current Amazon pricing.

Google released Gemma 4 with four variants spanning from pocket-sized to flagship. The VRAM spread is massive — the smallest model fits on a phone, while the largest demands a high-end desktop GPU. This guide breaks down exactly how much VRAM each variant needs at every common quantization level, so you can match the right model to your hardware.

The short version

Variant	Q4_K_M VRAM	GPU you need
E2B (~2B)	~1.5GB	Any 4GB+ GPU
E4B (~4B)	~2.5GB	Any 6GB+ GPU
26B-A4B (MoE)	~14GB	16GB GPU (RTX 4060 Ti 16GB, RTX 5070 Ti)
31B Dense	~20GB	24GB+ GPU (RTX 4090, RTX 5090)

If you are deciding which variant to run, the 26B-A4B MoE is the sweet spot — 30B-class quality that fits on a 16GB card. Full hardware recommendations are in our best GPU for Gemma 4 buyer's guide.

Detailed VRAM by quantization

Gemma 4 E2B (~2B parameters)

Quantization	Model Size	Total VRAM (with KV cache)
Q3_K_M	~0.9GB	~1.5GB
Q4_K_M	~1.2GB	~1.5GB
Q5_K_M	~1.5GB	~2GB
Q6_K	~1.7GB	~2.5GB
Q8_0	~2.2GB	~3GB
FP16	~4GB	~5GB

The E2B fits everywhere. Integrated GPUs with 4GB shared memory, older GTX cards, even the Intel Arc B580 — all handle it without issue. VRAM is a non-concern here.

Gemma 4 E4B (~4B parameters)

Quantization	Model Size	Total VRAM (with KV cache)
Q3_K_M	~1.8GB	~2.5GB
Q4_K_M	~2.5GB	~3.5GB
Q5_K_M	~3GB	~4GB
Q6_K	~3.5GB	~5GB
Q8_0	~4.5GB	~6GB
FP16	~8GB	~10GB

Any GPU with 6GB+ VRAM handles Q4_K_M comfortably. For FP16 (useful for testing or fine-tuning), you need 10GB or more.

Gemma 4 26B-A4B (MoE — the important one)

Quantization	Model Size	Total VRAM (4K ctx)	Total VRAM (8K ctx)
Q3_K_M	~11GB	~13GB	~14.5GB
Q4_K_M	~14GB	~16GB	~18GB
Q5_K_M	~17GB	~19GB	~21GB
Q6_K	~20GB	~22GB	~24GB
Q8_0	~26GB	~28GB	~30GB

This is where VRAM planning matters. The 26B MoE has 26 billion total parameters that all live in VRAM, even though only ~4B activate per token. At Q4_K_M, the model weights alone are ~14GB. Add KV cache for a typical conversation and you are at 16-18GB.

On a 16GB card (RTX 4060 Ti 16GB, RTX 5070 Ti, RTX 5080): Q4_K_M fits, but keep context under 4K tokens for stability. Q3_K_M gives more breathing room at a small quality cost.

On a 24GB card (RTX 4090, RTX 3090): Q4_K_M runs with plenty of headroom. You can push to Q5_K_M and maintain 8K+ context comfortably.

Gemma 4 31B Dense

Quantization	Model Size	Total VRAM (4K ctx)	Total VRAM (8K ctx)
Q3_K_M	~16GB	~18.5GB	~20GB
Q4_K_M	~20GB	~22GB	~24GB
Q5_K_M	~24GB	~26GB	~28GB
Q6_K	~28GB	~30GB	~32GB
Q8_0	~35GB	~37GB	~39GB

The 31B Dense is straightforward but demanding. At Q4_K_M, you need at least 22GB with any meaningful context. The RTX 4090 (24GB) barely fits it — long conversations or large context windows may cause out-of-memory errors. The RTX 5090 (32GB) is the comfortable choice, fitting Q4 and even Q5 with room to spare.

VRAM chart available at the original article

KV cache: the hidden VRAM eater

Every table above includes estimated KV cache overhead, but actual usage depends on your conversation length. A rough guide:

2K context: +1-2GB over model weights
4K context: +2-3GB
8K context: +3-5GB
16K context: +5-8GB

For the 26B MoE on a 16GB card, this is the critical constraint. The model weights fit at Q4, but a long back-and-forth conversation can push past 16GB. Our recommendation: use a tool like nvtop or nvidia-smi to monitor VRAM during inference, and reduce context length if you see usage hitting 95%+.

Which quantization should you use?

For Gemma 4 specifically:

Q4_K_M is the standard recommendation. Minimal quality loss, good VRAM efficiency.
Q5_K_M is worth it on the 26B MoE if you have a 24GB card — the quality bump is noticeable on reasoning tasks.
Q3_K_M is acceptable on the 26B MoE for 16GB cards that need context headroom. Quality loss is small.
Q8 and above are only practical on the E2B and E4B variants unless you have multi-GPU setups.

For a broader guide to quantization trade-offs across all models, see best quantization for local LLM.

GPU recommendations by variant

Variant	Budget pick	Best pick
E2B / E4B	Any GPU you already own	N/A — any modern GPU works
26B-A4B MoE	RTX 4060 Ti 16GB (~$400)	RTX 5070 Ti (~$750)
31B Dense	RTX 3090 used (~$600)	RTX 4090 (~$1,600)

For full GPU benchmarks and buying recommendations, head to our best GPU for Gemma 4 guide, or our broader best GPU for Gemma overview spanning the 2B/7B/27B classics. Need general VRAM guidance across all model families? See how much VRAM for local LLM. And if you are running models through Ollama, our best GPU for Ollama article covers setup specifics. Budget shoppers should check best budget GPU for local LLM for affordable options.

Related guides on Best GPU for LLM

Continue on Best GPU for LLM for the complete guide with interactive calculators and current GPU prices.

DEV Community