Mirza Iqbal

Posted on Jun 5

Gemma 4 makes on-device multimodal AI good enough to ship

#gemma #opensource #ai #llm

Everyone will repeat the headline 12B.

What actually changes things is the 2B that fits in your pocket.

Google released the Gemma 4 family this week, and most of the coverage will fixate on the biggest model. I want to point at the other end of the range, because that is where the real shift lives.

Gemma 4 in one breath

Per the model card, Gemma 4 is a family, not a single model.

It runs from an E2B at 2.3B effective parameters and an E4B at 4.5B effective, up through a 12B Unified, a 26B mixture-of-experts with 3.8B active, and a 31B dense model.

A 256K token context rides on the 12B. Inputs span text, image, and audio. Output is text.

It uses an encoder-free unified architecture, the weights are open, and you can download them today.

Sit with the small end of that list for a second. A model that reads text, sees images, and hears audio, sized to run on hardware you already own.

Why "effective parameters" is the quiet story

An E2B behaves like a far larger model while its memory footprint fits a phone or a cheap laptop.

That one design choice carries the whole thing.

For a decade the unspoken rule was that real AI lives in a data center and you rent it by the token. Small Gemma 4 models chip at that rule. Capability moves to the device, and that device is one most people already carry.

What changes for developers

This is the part I care about, because it rewrites the economics of shipping a feature.

Cost stops being per request. A cloud model bills every call forever. A local model is paid for once, at download, and then it runs for free.

Latency stops being a network problem. No round trip, no cold start, no region. Answers happen where the user is standing.

Privacy stops being a promise and becomes a property. Data never leaves the device, so the whole compliance conversation around sensitive input shifts under your feet.

Input is multimodal on the device too. Local voice and camera understanding with no cloud vision endpoint in the loop.

That old excuse, that a serious feature needs the cloud, got a lot weaker this week.

What changes for society

Widen the lens past the IDE.

Access is the big one. A mid-range phone becomes an AI device. That reaches people and places where metered cloud AI was never going to be affordable in the first place.

Resilience follows. Offline-first intelligence works on a train, in a clinic with bad wifi, in a region the cloud forgets.

Sensitive domains get an option they did not have. Health notes, legal documents, personal context. Things people are right to never hand to a server can now be read by a model that stays on the phone.

When the cost of running intelligence falls toward zero and the privacy cost falls with it, the set of people who get to use it gets much larger. That is the part worth being excited about.

Honest caveats, because someone has to say them

I am not going to oversell this.

A 12B is not a phone model. Call it a capable consumer-hardware model instead. True edge duty falls to the E2B and E4B tier, and even those want real RAM. A 2B-effective model is not free on a six-year-old handset.

Benchmark numbers on the card are the lab's own. MMLU Pro at 77.2, AIME 2026 at 77.5, MATH-Vision at 79.7. Treat those as a starting point and run your own eval, because your task is not their benchmark.

On-device output here is text. It reads image and audio, it does not generate them.

A local model is not a frontier model. For the hardest reasoning you will still reach up to the big hosted systems. Local does not replace the cloud here. What it does is take a huge slice of everyday work off the cloud entirely.

What I would reach for first

Features that only ever called the cloud because nothing good ran locally.

Private intake and triage where the data legally cannot leave. Field tools that have to work with no signal. On-device document and audio sorting for people who never wanted their files uploaded in the first place.

None of that was sensible to build on metered cloud inference. A lot of it is sensible now.

One question to leave you with

Look at one feature in your product that quietly ships user data to a model API.

Did it go to the cloud because it had to, or because a year ago nothing good enough ran on the device?

Tell me which one it is, and whether that answer still holds this week.

DEV Community