DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at runaihome.com

Google Gemma 4 for Local AI: Which Size Fits Your GPU? (2026 Guide)

This article was originally published on runaihome.com

Gemma 4 launched April 2, 2026, with four variants under Apache 2.0—Google's first Gemma release without the custom license that enterprise legal teams had been flagging as a blocker. The headline number circulating through local AI communities within hours of launch: the 26B MoE generates approximately 149 tokens per second on an RTX 4090.

That's not a typo. The 26B "A4B" model uses Mixture-of-Experts routing that activates only about 4 billion of its 26 billion parameters per token inference. You're getting near-26B reasoning quality at something close to 4B-class compute. What that means for GPU selection is non-obvious—and that's what this piece works through.

The four variants decoded

Google structured Gemma 4 as two size tiers, each with two architectures. The naming confused people initially, so here's what each label actually means:

Model Architecture Active params/token Context window Multimodal
E2B Dense + PLE ~2.3B 128K Vision + Audio
E4B Dense + PLE ~4.5B 128K Vision + Audio
26B A4B Mixture-of-Experts ~4B (of 26B total) 256K Vision only
31B Dense Dense 31B 256K Vision only

"E" stands for Efficient, not Enterprise. The E2B and E4B use Per-Layer Embeddings (PLE)—a technique that packs more parameter capacity into less active computation than a standard dense architecture. They're edge-optimized and designed for tight memory budgets. Both also support native audio input (automatic speech recognition and speech-to-translated-text), a capability neither the 26B nor 31B has.

"26B A4B" means 26 billion total parameters with approximately 4 billion active per forward pass via MoE routing. The entire 26B weight file must still load into VRAM—that's the catch discussed in detail below—but per-token compute tracks with a 4B model, which is why the speed numbers look anomalous.

The two large models (26B and 31B) both have 256K context windows. The E-series cap at 128K.

VRAM requirements at each quantization level

The Q4_K_M GGUF for the 26B MoE is approximately 14 GB on disk. Runtime VRAM consumption is higher: KV cache, activations, and runtime buffers add 1–3 GB at short context, pushing total usage to 15–17 GB. At 32K+ context the KV cache adds more, reaching 18–22 GB total; at the model's maximum 256K context it's impractical on consumer VRAM without extensive offloading.

Model Q4_K_M VRAM Q8_0 VRAM FP16 VRAM
E2B ~3 GB ~5 GB ~10 GB
E4B ~5 GB ~8 GB ~18 GB
26B A4B MoE ~15–17 GB ~28 GB ~55 GB
31B Dense ~18–20 GB ~32 GB ~62 GB

Q4_K_M is the default Ollama quantization for both large models and the practical floor for usable quality on creative and coding tasks. Dropping the 26B MoE to Q3_K_M reduces the GGUF to ~12 GB (workable on 12–16 GB cards) but introduces measurable degradation on structured reasoning. Q8 for the large models requires 28–32 GB of VRAM or splits uncomfortably between VRAM and system RAM.

Inference speed: the MoE advantage, and its ceiling

The RTX 4090's memory bandwidth is 1,008 GB/s. For the 26B MoE at Q4_K_M, the per-token active weight window is approximately 4B × 0.5 bytes = ~2 GB—the MoE routing fetches only the active expert fraction of the full model per token. Result: approximately 149 tokens per second in benchmarks on that hardware.

For the 31B Dense, all 31 billion parameters activate per forward pass. At Q4_K_M (~18 GB loaded), the full weight file sweeps through the memory bus per token. Theoretical ceiling ≈ 1,008 GB/s ÷ 18 GB ≈ 56 tok/s; real-world lands at approximately 28–35 tok/s with short context. When 128K+ context forces KV cache overflow to system RAM (DDR5 quad-channel ~50–60 GB/s), that drops to ~7–8 tok/s—a number that appeared in early benchmarks when testers hit the maximum context window.

GPU VRAM 26B MoE Q4 tok/s 31B Dense Q4 short ctx tok/s
RTX 5060 Ti 16GB 16 GB 40–50 (context limited)
RTX 5070 Ti 16GB 16 GB ~70 (context limited)
RTX 3090 24GB 24 GB 64–119 ~26–30
RTX 4090 24GB 24 GB ~149 ~28–35

The "context limited" entries are not a small-print caveat—they're the reason the 16 GB decision is genuinely complicated. The 31B entries are blank for 16 GB cards because the model simply doesn't fit at Q4; partial RAM offloading drops generation to CPU bandwidth speeds (~5–10 tok/s), not practical for interactive use.

The 16 GB trap: what actually happens on an RTX 5060 Ti or 5070 Ti

If you own an RTX 5060 Ti 16GB, RTX 5070 Ti 16GB, or RTX 4060 Ti 16GB, ollama pull gemma4:26b will succeed and inference will start. The problem emerges as conversations lengthen.

gpuforllm.com measures the 26B A4B's actual VRAM demand at approximately 17 GB—1 GB over the physical limit. Ollama compensates by adjusting batch sizes and offloading KV cache overflow to system RAM once context accumulates. For very short sessions (under ~1,500 tokens of total context), you may never notice. For document analysis, extended research sessions, or code reviews that span multiple files, generation speed will drop toward single digits as the session grows.

Practical options for 16 GB owners:

  • Create a custom Modelfile with PARAMETER num_ctx 2048 to cap context and keep everything on-GPU at 40–70 tok/s
  • Use Q3_K_M (~12 GB) if a pre-quantized version is available, with some quality trade-off
  • Fall back to E4B Q8 (~8 GB) for full 128K context without VRAM pressure

The 26B MoE with genuine 256K context headroom requires 24 GB. For hardware comparisons at the 16 GB tier—including bandwidth and real inference speed differences between the RTX 5060 Ti and 5070 Ti—see our RTX 5070 Ti vs RTX 5080 breakdown.

If you're considering 24 GB, the used RTX 3090 currently runs $895–$1,200 on eBay (May 2026 completed listings). For the full 3-year cost math on that decision, including electricity and residual value, see our RTX 3090 value analysis.

Quality benchmarks

Gemma 4's 31B Dense tops the open-weight under-70B category on math, reasoning, and code as of April 2026:

Model MMLU AIME 2026 LiveCodeBench v6
Gemma 4 31B Dense 85.2% 89.2% 80.0%
Gemma 4 26B MoE 82.6%
Llama 3.3 70B 86.0%

The gap between 26B MoE and 31B Dense is approximately 2.6 percentage points on MMLU. That gap matters at the extreme edge of competitive math and multi-hop legal reasoning. For typical chat, coding assistance, and summarization it's invisible—and the 26B MoE's 4–5× speed advantage on identical hardware is not.

The E4B at 42.5% on AIME 2026 is worth noting for a sub-5B model. Previous-generation Gemma 3 4B couldn't approach that on hard math. The PLE architecture genuinely extracts more reasoning per active parameter than standard dense models at that size class.

For coding specifically: LiveCodeBench v6 tracks real competitive programming problems rather than synthetic code completion. The 31B at 80.0% is competitive with larger closed-weight models. If you're evaluating local AI for development workflows, aicoderscope.com covers the full AI coding tool landscape including local model comparisons.

The Apache 2.0 detail that matters

Gemma 1, 2, and 3 all launched under a custom Google li

Top comments (0)