Small models, big ideas: what Google Gemma and MoE mean for developers

#ai #google #llm #news

We at zyte-devrel try to stay plugged into what is happening in the AI and developer tooling space, not just because it is interesting, but because a lot of it starts having real implications for how we build and think about web data pipelines. Lately, one development that has had us genuinely curious is Google's new Gemma 4 model family, and specifically the direction it points toward with Mixture of Experts (MoE) architecture.

This is not a deep tutorial. It is more of a "hey, here is what we have been poking at" - the kind of update we would share in a Slack channel or over coffee. If you wanna participate in such discussions, our discord is always a welcoming platform.

What is Gemma 4?

Gemma has been dubbed as stripped down versions of Google Gemini. The new Gemma 4 is Google's latest family of open-weight language models, released last week. The lineup covers four sizes:

2B: ultra-efficient, built for mobile and edge devices
4B: enhanced multimodal capabilities, still edge-deployable
26B: sparse model using Mixture of Experts architecture (more on this below)
31B: dense model for more demanding tasks

All four variants support multimodal input (text and images), over 140 languages, a 128K-256K token context window, and agentic workflows with tool use and JSON output. The 2B and 4B models are specifically designed to run fully offline on modern edge devices like smartphones, with no internet dependency at all.

According to Google's Gemma 4 model page, the family ranks third among open-weighted models on the LM Arena leaderboard and uses 2.5 times fewer tokens than comparable models for equivalent tasks.

The Gemma 4 26B MoE, specially caught my attention because unlike other variants it's based on MoE architecture and it does make a difference :

What is MoE, and why does it matter?

Mixture of Experts (MoE) is one of those ideas that sounds complex but is actually pretty intuitive once you hear the analogy.

In a traditional dense neural network, every parameter in the model activates for every input. It is like calling your entire company into a meeting every time someone has a question. It works, but it is expensive.

MoE works differently. Instead of one large model doing everything, you have a set of smaller "expert" sub-networks, each specialized in different patterns, plus a router that looks at each incoming token and decides which one or two experts to activate. Most of the model sits idle at any given moment.

The result: you get the quality of a much larger model at a fraction of the inference cost.

The Gemma 4 26B model is a great illustration of this. It has 26 billion total parameters, but during inference it only activates around 3.8 billion of them. You get near-26B quality at roughly 3.8B compute cost. That is the MoE advantage in one number.

Other models that take the same approach:

Mixtral 8x7B: eight experts, two active per token; it outperforms Llama 2 70B on most benchmarks at far lower inference cost
Kimi: Moonshot AI's model, also MoE-based, has been making similar waves in the open-model space

For a deep dive on how MoE works under the hood, the Hugging Face guide to mixture of experts is well worth the read.

Since the models are free, if you have the right machine you can host them lcoally using Ollama or call them using API services like OpenRouter.

My prefered way of using a new mode is through Claude, but I believe Gemma4 has a different tool calling structure so it is not compatible yet, but you can use it with LMStudio, or skil all that because you can now

Run Gemma 4 offline on an iPhone

Here is the part worth sharing, because it genuinely surprised us.

Using the Google Edge AI Gallery app from the App Store, you can load a Gemma 4 model and run it with airplane mode on. No API calls, no cloud round-trips, no data leaving the device. Just the model running locally on your phone.

The experience is not going to replace a foundational frontier model for complex reasoning. But that is not the point. For quick classification, summarization, or just experimenting with local inference, the 2B and 4B variants are remarkably capable, and there are zero API costs with no data leaving your device. And since it is multi-modal you can practically point your phone camera to a paper recipt and ask it to save the details in a spreadhseet.

If you have not tried running a local large language model (LLM) yet, this is probably the lowest-friction entry point on hardware you already own.

Why should developers building data pipelines care?

Here is where it connects back to what a lot of us are building.

When LLMs run on-device or at the edge, the calculus around data pipelines shifts in a few useful directions:

Tokens are getting expensive and when a model as good as Gemma 4 or Qwen-3.5 is free and open-weighted it's a welcome development. Everyone's complaining about running out of their claude usage quota last couple of weeks or getting huge bills, thanks to giving Opus API Keys to OpenClaw. These things can be significantly addressed using Open Models.

No API round-trips: on-device inference eliminates latency from cloud API calls. For classification tasks running inside a scraping pipeline, this is a meaningful difference.
Data privacy: running extraction locally means scraped content never leaves your infrastructure. For regulated industries or sensitive datasets, that is a significant advantage.
Cost at scale: if you are doing high-volume classification — is this a product page? is this content in the target language? — running a small local model beats paying per-token at scale.
Edge preprocessing: a small LLM can filter and classify pages before they ever reach a more expensive cloud model for deeper analysis, and I am personally looking forward to run them on SBCs like a Raspberry Pi.
Open Weights: people often confuse open-weights models with open-source models, while the lines may be blurry and even I don't fully understand the difference, one thing I know for sure is that Gemma 4 is available under the Apache 2.0 license, which allows building and selling products on top of it and open-weights allows you to fine-tune it for your use-case or application.

Here's me playing it with it on my iPhone 16, completely offline:

Just checking in

We do not have grand proclamations here. This is a space that is moving fast, and we are learning alongside everyone else.

If you have been experimenting with local LLMs in your scraping or data extraction workflows, we would genuinely love to hear about it. Drop a comment below, or find us on the Zyte discord and read more interesting blogs on Zyte Blog.

If you want to try this yourself, here are three good starting points:

Google Edge Gallery: available on the App Store and Playstore, runs Gemma 4 locally on iOS
Gemma models on Hugging Face: for running on desktop or server
Google's Gemma 4 model page: full family overview, benchmarks, and architecture details