DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at runaihome.com

Gemma 4 QAT for Local AI in 2026: How Google's June 5 Checkpoints Put the 26B in 15GB

This article was originally published on runaihome.com

On June 5, 2026, Google released Quantization-Aware Training (QAT) checkpoints for every Gemma 4 size. The practical result: the 26B-A4B model that needed roughly 17 GB of VRAM at standard Q4 — over the limit of a 16 GB card — now runs in about 15 GB with near-original quality. The headline figure Google quotes is a ~72% VRAM cut versus the BF16 baseline.

That changes the GPU recommendation we published on May 26. Back then, the Gemma 4 GPU guide had to warn 16 GB owners that the 26B MoE technically loaded but spilled KV cache to system RAM past ~1,500 tokens. The QAT checkpoints largely close that gap. They also introduce a trap: if you convert the checkpoints to GGUF yourself the wrong way, you lose most of the quality QAT was supposed to preserve.

TL;DR

QAT shrinks every Gemma 4 model by about 72% over BF16 with almost no quality loss, so the 26B-A4B finally fits a 16 GB card and the 31B fits 24 GB comfortably. The catch: don't hand-convert the checkpoints — use Unsloth's pre-converted UD-Q4_K_XL GGUFs or Ollama's -it-qat tags. vLLM users get compressed-tensors checkpoints for every size except the 26B MoE.

16 GB GPU (e.g. 5060 Ti 16GB) 24 GB GPU (e.g. used RTX 3090) Apple Silicon / unified
Best QAT fit 26B-A4B (~15 GB) or 12B (~7 GB) 31B (~18 GB) with room for context 26B/31B via MLX
What you gain 26B now stays on-GPU at full speed 31B at full 256K context headroom Largest models on one box
The catch Tight margin; cap context if doing long docs Was already fine pre-QAT; now has slack E4B has a known Triton speed bug

Honest take: If you own a 16 GB card and skipped the 26B MoE because of the VRAM overflow, the QAT checkpoint is the update that makes it actually usable — pull gemma4:26b-a4b-it-qat and stop fighting your KV cache.

What QAT actually does (and why it's not just another Q4)

Standard quantization — Post-Training Quantization (PTQ) — takes a fully trained BF16 model and compresses the weights to 4-bit afterward. The model never "knew" it would be quantized, so rounding error accumulates and quality drops, sometimes by several points on reasoning benchmarks.

Quantization-Aware Training simulates the 4-bit rounding during training. The model learns to place its weights where quantization hurts least, so the final int4 checkpoint lands much closer to the BF16 original. Google applied its QAT recipe to the Q4_0 format for every Gemma 4 size, and the released checkpoints hold near-full-precision quality at a Q4 footprint.

This is the same playbook Google ran for Gemma 3 in 2025, where QAT int4 dropped the 27B from 54 GB (BF16) to 14.1 GB, the 12B from 24 GB to 6.6 GB, and the 4B from 8 GB to 2.6 GB — all while staying within a few Elo points of the BF16 versions. Gemma 4 QAT extends that to the full current lineup, including the new 12B and the 26B-A4B MoE. If the general idea of quantization levels is new to you, our quantization explainer and the Q4 vs Q5 vs Q6 vs Q8 quality breakdown cover the fundamentals.

The new memory map

Here's what each Gemma 4 QAT variant needs to run, per Google's release notes and Unsloth's GGUF sizing:

Model Active params/token QAT memory to run Pre-QAT Q4 (for comparison)
E2B ~2.3B ~3 GB ~3 GB
E4B ~4.5B ~5 GB ~5 GB
12B Dense 12B ~7 GB ~10–12 GB
26B-A4B MoE ~4B (of 26B) ~15 GB ~15–17 GB
31B Dense 31B ~18 GB ~18–20 GB

Two things stand out. First, the E2B QAT checkpoint is about 1 GB on disk and runs in roughly 3 GB — small enough that Google is positioning it for phones and laptops with integrated graphics. Second, the 26B-A4B at ~15 GB is the entry that matters most for desktop home labs: it crosses back under the 16 GB line.

That's the whole story for RTX 5060 Ti 16GB owners. The May guide measured the standard 26B at ~17 GB of real demand, 1 GB over the card's capacity, which forced KV-cache overflow into system RAM as conversations grew. The QAT checkpoint lands at ~15 GB, leaving a small but workable margin for context on a 16 GB card. For long-document or multi-file code-review sessions you'll still want to watch context length, but the cliff that hit the standard build at ~1,500 tokens is gone for normal chat and coding use.

For 24 GB cards, QAT is pure upside. The 31B Dense at ~18 GB leaves 6 GB for context and runtime buffers on a used RTX 3090 — the 256K context window becomes genuinely usable instead of theoretical. Used 3090 pricing has climbed this year on the GDDR7 shortage; our RTX 3090 value analysis and the 5060 Ti 16GB vs 3090 total-cost piece have the current numbers if you're deciding between the two tiers.

The conversion trap: don't roll your own GGUF

This is the part that's tripped up early adopters, and it's worth stating plainly: do not convert the Gemma 4 QAT Hugging Face checkpoints to GGUF yourself with a naive llama.cpp pass.

The reason is technical but concrete. The QAT checkpoints ship in BF16 with BF16 scales. llama.cpp's Q4_0 format uses F16 scales. Converting QAT BF16 → llama.cpp Q4_0 is not lossless — the scale-format mismatch reintroduces exactly the kind of accuracy drop QAT was trained to avoid. People who did the straightforward conversion reported measurable quality regressions despite producing a larger file than the optimized version.

The fix is to use Unsloth's pre-converted dynamic GGUFs. Unsloth ships a single recommended quant per model — UD-Q4_K_XL — that is both smaller and more accurate than a hand-rolled Q4_0 of the same checkpoint. The 26B-A4B UD-Q4_K_XL file is about 17 GB on disk and runs in ~15 GB. If you pull through Ollama, you skip the question entirely:

# QAT variants published in the Ollama library
ollama pull gemma4:e2b-it-qat
ollama pull gemma4:e4b-it-qat
ollama pull gemma4:12b-it-qat
ollama pull gemma4:26b-a4b-it-qat
ollama pull gemma4:31b-it-qat
Enter fullscreen mode Exit fullscreen mode

Ollama handles the Unsloth-style quantization automatically for the -it-qat tags, so you don't manage GGUF conversion at all. You do need an Ollama build with native Gemma 4 (gemma4) support — if ollama pull errors on an unknown model, update Ollama first. If you hit VRAM errors regardless of QAT, our CUDA out of memory fix guide covers the num_ctx and KV-cache settings that reclaim the most headroom.

vLLM: compressed-tensors, with one gap

If you're serving batched, multi-user inference rather than running a single chat session, vLLM is the better target — and Google released QAT in compressed-tensors format for it. These checkpoints use 4-bit integer weights with 16-bit activations (W4A16, group_size=32), tagged -w4a16-ct in Google's Hugging Face namespace.

The one gap: the 26B-A4B MoE is not in the W4A16 QAT set. Its expert dimension (704) is small enough that 4-bit quantization causes excessive quality loss, so Google shipped compressed-tensors checkpoints for E2B, E4B, 12B, and 31B only. For the 26B MoE on vLLM you fall back to a higher-precision format or run the GGUF path through llama.cpp/Ollama instead. If you're weighing the two serving engines, our vLLM vs Ollama breakdown explains when each one wins.

Speed: QAT doesn't make it faster, it makes it fit

A common misconception is that QAT boosts tokens/sec. It mostly doesn't — decode speed on these models is bound by memory bandwidth, and the active-weight footprint per token is similar to standard Q4. What QAT buys you is fit: keeping the model entirely on-GPU instead of spi

Top comments (0)