DEV Community

Cover image for Google Releases Gemma 4 12B: Encoder-Free Multimodal Projection
pueding
pueding

Posted on • Originally published at learnaivisually.com

Google Releases Gemma 4 12B: Encoder-Free Multimodal Projection

What: Google released Gemma 4 12B, an open multimodal model whose headline trick is encoder-free multimodal projection — it turns images and audio into tokens by projecting them straight into the token space, instead of running them through a dedicated encoder network.

Why: The separate vision and audio encoders most multimodal models carry are extra parameters, compute, and latency that run before the language model sees anything; dropping them is a big reason a 12B model can field pictures and sound inside 16 GB of memory.

vs prior: Versus the standard recipe — a frozen vision transformer (ViT) plus a projector bolted onto a text model — Gemma 4 12B has no separate encoder at all: each image patch becomes a token through one matrix multiply directly into the backbone.

Think of it as

A meeting where guests either go through a translator or speak the language.

              IMAGE / AUDIO ARRIVES
                       │
        ┌──────────────┴──────────────┐
        │                             │
 ┌──────▼───────┐             ┌───────▼──────┐
 │  THE OLD WAY │             │ ENCODER-FREE │
 │via translator│             │ speak direct │
 └──────┬───────┘             └───────┬──────┘
        │                             │
  a whole vision/audio          one matrix-multiply
  encoder runs first            projects to a token
        │                             │
        ▼                             ▼
 ✗ extra params + latency      ✓ same token space,
   before the LLM looks          ~16 GB, lower latency
Enter fullscreen mode Exit fullscreen mode
  • text token = a guest who already speaks the room's language
  • vision/audio encoder = a separate translator the old way routes pictures and sound through
  • encoder-free projection = one matrix-multiply that puts vision and audio into the room's language directly
  • shared token space = the single language every guest speaks once inside

Quick glossary

Encoder-free (VLM) — A multimodal model with no separate encoder for non-text inputs — rather than run an image through a vision network first, it projects the raw input straight into the model's token space. The lineage runs through research models like Fuyu and EVE.

Vision encoder / ViT — A Vision Transformer — a stack of attention-and-MLP layers that turns an image into feature vectors. In the usual recipe it sits in front of the language model as a second network; encoder-free designs delete it.

Patch — An image is cut into a grid of small squares (e.g. 16×16 pixels). Each patch is flattened into a list of raw numbers and treated as one unit of input — the visual equivalent of a text token.

Projection — A single matrix multiply that maps a vector of one size onto a vector of another. Here it maps a flattened image patch onto a vector the same width as a word's embedding — so the result is a token; audio is folded into that same space.

Token / embedding space — A transformer doesn't read words or pixels; it reads dense vectors. The "embedding space" is the shared vector format every input must arrive in — putting images and audio there is what lets one backbone read all three.

Native audio — Audio handled inside the model as tokens, rather than transcribed to text by a separate speech model first. Gemma 4 12B is the first mid-sized Gemma to take audio in natively.

The news. On June 3, 2026, Google released Gemma 4 12B, an Apache-2.0 model that drops the separate vision and audio encoders most multimodal models bolt on. Instead it projects both kinds of input straight into the language backbone: vision through a lightweight module — reportedly a single matrix multiply plus positional and normalization terms — and audio into the same dimensional space as text tokens. It is the first mid-sized Gemma to take native audio input, runs on 16 GB of VRAM or unified memory, and reportedly scores near Google's larger 26B mixture-of-experts model. Read the announcement →

Picture the meeting. A text prompt is a guest who already speaks the room's language — it walks in and starts talking. A picture and a sound clip don't: the usual fix hires a separate translator for each, a whole second staffer who listens, re-voices everything, and only then lets the guest join. Those translators are the model's vision and audio encoders — extra networks that run before the language model sees a thing. Gemma 4 12B fires the translators. It teaches pictures and sound to speak the room's language directly, in one quick step, so every guest — text, image, audio — sits at the same table as an ordinary token.

Underneath the metaphor, "speaking the room's language" means landing in the model's embedding space — the dense vectors a transformer actually consumes. A token ID becomes a vector by a lookup; an image patch becomes one by a projection. As a toy example, cut a 256×256 image into 16×16 patches and you get 256 patches, each a flat list of 16·16·3 = 768 raw numbers. The old way pushes patches like these through a vision transformer — tens of attention-and-MLP layers — before the LLM gets a single feature. Gemma's encoder-free path instead, by Google's description, applies a single matrix multiply (plus a positional term and normalization) that turns each patch straight into a token, the same shape as a word's embedding. Audio is projected into that same space too. The whole pre-LLM encoder stack collapses to that one projection — and the backbone itself takes over the visual and acoustic processing.

Approach How an image enters Separate encoder? Cost profile
Encoder-based (ViT + projector) image → vision transformer (tens of layers) → projector → tokens yes — a full vision network runs first more parameters and latency before the first output token
Encoder-free (Gemma 4 12B) patches → one matrix multiply (+ position/norm) → tokens no separate encoder ~16 GB, lower pre-decode latency (Google, reported)

Removing the encoder stack has consequences, but the wins are concrete. A separate vision tower is parameters you store, compute you run, and latency you pay before the first output token; deleting it is a big reason a 12B model can field images and audio inside 16 GB rather than needing a datacenter card, and part of why Google can claim quality near its 26B mixture-of-experts model despite the smaller, simpler stack. The catch is that the backbone now has to learn visual and acoustic structure itself, with no pretrained encoder doing that work for it — which is plausibly why this ships as a 12B model trained for it from the start rather than a vision adapter glued onto an existing text model. The architectural specifics beyond the single-matmul description are not yet fully documented.

The payoff is a cleaner idea of what "multimodal" even requires. You don't strictly need a bespoke eye and ear bolted onto a language model; if every input can be projected into the same token space, one backbone can read all of them. Gemma 4 12B is a bet that for a small, open model meant to run on modest hardware, fewer moving parts beats a heavier, more specialized stack.

Goes deeper in: LLM Internals → Embeddings → From Token IDs to Vectors

Related explainers

FAQ

What is encoder-free multimodal projection?

It is a way to make a language model multimodal without a separate vision or audio encoder. Instead of running an image through a dedicated network first, the model cuts it into patches and turns each patch into a token with a single matrix multiply — projecting it directly into the same embedding space as text tokens. Audio is handled the same way. One backbone then reads text, image, and audio tokens as one stream.

Why does removing the vision encoder matter?

A separate vision encoder is extra parameters to store, extra compute to run, and extra latency before the language model produces its first token. Dropping it is a big part of why Gemma 4 12B can handle images and native audio inside about 16 GB of memory and still report quality near Google's larger 26B mixture-of-experts model. The trade-off is that the backbone has to learn visual and acoustic structure itself, which is why the design ships as a model trained for it rather than a bolt-on.

How does it relate to native multimodal models like GLM-5V?

They answer different questions. "Native vs vision-bolted" is about training: was the model multimodal from the start, or was a vision module added to a finished text model? "Encoder-free" is about architecture: is there a separate encoder network at all, or does the input get projected straight into the token space? A model can be natively trained and still use a vision encoder; Gemma 4 12B is unusual in being both natively multimodal and encoder-free.


Originally posted on Learn AI Visually.

Top comments (0)