Google Ships Gemma 4 QAT Checkpoints: Quantization-Aware Training

#ai #llm #machinelearning #tutorial

What: Google shipped quantization-aware-trained (QAT) checkpoints for the Gemma 4 family — open weights that were trained to survive being squeezed down to 4-bit (and 2-bit on the decode layers).

Why: Low-bit weights are how a real model fits on a phone: Google reports the compact E2B size lands at about a 1 GB memory footprint, small enough to run on consumer hardware instead of a datacenter GPU.

vs prior: Versus post-training quantization (PTQ) — which rounds the weights to the low-bit grid after training and falls off an accuracy cliff at very low bit-widths — QAT simulates that rounding during training, so the weights learn to sit on the grid in the first place.

Think of it as

a singer rehearsing on a cheap keyboard with only a few keys

                   THE NOTE TO HIT
                          │
              ┌───────────┴───────────┐
              │                       │
      ┌───────▼───────┐       ┌───────▼───────┐
      │  PTQ          │       │  QAT          │
      │ (round after) │       │ (train on it) │
      └───────┬───────┘       └───────┬───────┘
              │                       │
     sing free, then          rehearse on the
     auto-tune onto           few keys all along
     the nearest key          so notes land there
              │                       │
              ▼                       ▼
       ✗ far note snaps        ✓ note already sits
         hard — sour             on a key — clean
        (accuracy cliff)        (no cliff)

model weight = a note the singer wants to hit
4-bit grid = the few keys the cheap keyboard actually has
post-training quantization = auto-tuning a freely-sung take onto the nearest key afterward
quantization-aware training = rehearsing on those keys all along, so every note already lands on one
accuracy cliff = how sour it sounds when auto-tune drags a far-off note onto a key

Quick glossary

Quantization-Aware Training (QAT) — Training (or fine-tuning) the model while simulating the low-bit rounding on every forward pass, so the weights learn to land on the quantization grid. The result is a checkpoint that holds up far better at low bit-width than the same model quantized after the fact.

Post-Training Quantization (PTQ) — The cheap default: take a finished full-precision model and round its weights to the low-bit grid afterward, with no retraining. Fast, but the rounding error it introduces is exactly what QAT is built to avoid. GPTQ and AWQ are PTQ methods.

Bit-width / precision — BF16 ("Brain float", 16 bits, from Google Brain) stores a weight in 2 bytes with a near-continuous range; INT4 stores it in 4 bits, i.e. only 16 possible values. Fewer bits = fewer grid points = bigger rounding error.

Q4_0 / GGUF — Q4_0 is a 4-bit weight format; GGUF is the on-disk file format llama.cpp loads (e.g. gemma-4-E2B-it-qat-q4_0.gguf). Gemma 4 also ships "compressed tensors" for serving in vLLM.

Straight-through estimator (STE) — Rounding has a zero gradient almost everywhere, so you can't normally backprop through it. STE is the trick QAT uses: round on the forward pass, but pass the gradient through as if rounding were the identity — letting training "feel" the grid.

Mixed precision by layer — Not every layer is equally fragile. Gemma 4's mobile format keeps the reasoning-critical layers at higher precision and pushes the bulky token-generation (decode) layers down to 2-bit, where the memory savings are largest.

The news. On June 5, 2026, Google released quantization-aware-trained checkpoints for the Gemma 4 family, spanning the compact E2B and E4B edge models up through 12B and larger sizes. Alongside the standard Q4_0 4-bit format, a new mobile schema applies targeted 2-bit quantization to the token-generation layers while keeping the core reasoning layers at higher precision, plus an optimized KV cache and static activations. With the mobile format, Gemma 4 E2B's reported footprint drops to about 1 GB. Checkpoints ship as GGUF for llama.cpp and as compressed tensors for vLLM. Read the announcement →

Picture the singer at a cheap keyboard that has only a handful of keys. The pitches she actually wants to sing live between those keys. The lazy way to record is to sing freely and then auto-tune the take onto the nearest key afterward — and if a note was sitting halfway between two keys, the snap yanks it a long way and the whole phrase sounds sour. That sour snap is post-training quantization: the model finishes training wherever its weights landed, and only then do you round them to the coarse low-bit grid. Quantization-aware training is the disciplined alternative — the singer rehearses on those exact keys the entire time, so every note she learns already lands on one. When you finally record in low fidelity, nothing has to move.

Underneath the metaphor, the "keys" are the grid of values a low-bit format can store, and the "sour snap" is rounding error. Drop a weight from 16-bit down to 4-bit and you go from a near-continuous range to just 16 representable values — so the rounding step has to shove each weight onto the nearest of those few points. PTQ does this once, at the end, to weights that never anticipated it. QAT instead simulates the rounding on every training step (using a straight-through estimator so gradients still flow), so the network learns weights that already sit on the grid — and learns to compensate elsewhere for the little that can't. That rounding gap, which widens as the bit-width drops, is exactly what PTQ pays and QAT trains away.

The reason anyone bothers is memory. Take a 2-billion-parameter model (illustrative — E2B is Google's compact "effective-2B" size). At BF16 (2 bytes per weight) the weights alone need 2,000,000,000 × 2 = 4 GB. Round them to 4-bit (about half a byte each) and that's roughly 1 GB — a ~4× shrink. Gemma 4's mobile format then pushes the bulky decode layers down to 2-bit while protecting the reasoning-critical layers, and — with the KV cache and activations optimized on top — Google reports that format brings E2B's footprint to about 1 GB, small enough to run on phone-class hardware. The mixed-precision-by-layer idea is simple: spend your bits where the model is fragile, save them where it is robust.

Approach	When rounding is applied	4-bit accuracy	Cost to produce
Post-training quantization (PTQ)	after training, once	falls off a cliff at very low bit-width (setup-dependent)	cheap — no retraining
Quantization-aware training (QAT)	simulated on every training step	higher quality than standard PTQ at 4-bit (Google)	needs a training / fine-tune pass

The catch is that QAT is not free: someone has to run that extra training pass, which is why it ships from a lab with the GPUs rather than as a one-line conversion you run at home. That is exactly why a vendor releasing pre-quantized QAT checkpoints matters — Google eats the training cost once, and everyone downloading the GGUF gets 4-bit weights without the accuracy cliff they'd hit by quantizing the model themselves. PTQ still has its place when you can't retrain, and aggressive low-bit work brings its own headaches like outlier weights — but for a model meant to live on a phone, training on the grid is what makes the small size honest.

Goes deeper in: LLM Internals → Quantization → The Quantization Process

Related explainers

LongLive 2.0 — NVFP4 W4A4 training and inference — the same "train at low precision, not just serve at it" idea, taken all the way to 4-bit activations
QCA — outlier injection for PTQ — the failure mode on the other side: what makes post-training quantization break, and how PTQ fights back without retraining
MobileMoE — DRAM-aware scaling — the other half of "fits on a phone": shrinking the active memory of a mixture-of-experts model, not its bit-width

FAQ

What is quantization-aware training (QAT)?

QAT trains or fine-tunes a model while simulating low-bit rounding on every forward pass, so the weights learn to land on the quantization grid. Because the network adapts to the rounding during training, the final checkpoint can be stored at low precision — Gemma 4 ships at 4-bit, with 2-bit decode layers in its mobile format — with much less quality loss than rounding the weights afterward.

How is QAT different from post-training quantization?

Post-training quantization (PTQ) rounds a finished full-precision model down to the low-bit grid once, at the end, with no retraining — cheap, but it introduces rounding error the model never learned to absorb, which becomes an accuracy cliff at very low bit-widths. QAT moves that rounding into training, so the weights already sit on the grid and the model compensates for what little error remains.

How does Gemma 4 fit in about 1 GB on a phone?

Two things stack. First, 4-bit weights are roughly 4× smaller than BF16 (about half a byte per weight instead of two bytes). Second, Gemma 4's mobile format pushes the bulky token-generation layers down to 2-bit while keeping reasoning-critical layers higher, and optimizes the KV cache and activations. Google reports the compact E2B size lands at about a 1 GB footprint with the mobile format, and QAT is what keeps that aggressive squeeze from wrecking quality.

Originally posted on Learn AI Visually.