Originally published on AI Tech Connect.
Three things builders should know The model now fits the device. Google DeepMind released Gemma 4 12B on 3 June 2026, built to run locally on a consumer laptop with 16GB of RAM, and the first mid-sized Gemma with native audio inputs. QAT shrank it to about 1GB. Around 5 June 2026 DeepMind shipped quantisation-aware training checkpoints — a Q4_0 build plus a new mobile-specialised format that cuts the Gemma 4 E2B footprint to roughly 1GB. Arm hardware makes it interactive. Arm reports Gemma 4 E2B on Armv9 CPUs hits about a 5.5x average prefill speedup and up to 1.6x faster decode — fast enough for real on-device use, not a demo. For the past two years the on-device story has been "almost". You could run a small model on a phone, but it was either too dumb to be useful or too slow to feel…
Top comments (0)