
"This article was originally published on my Substack."
There’s a quiet assumption baked into the way most of us use AI today: you type a prompt, it leaves your machine, travels to a data center somewhere, gets processed on hardware you don’t own, and the answer comes back. For most of the last three years, “using AI” has meant “renting AI.” Your data leaves. You hope for the best.
Gemma 4 is Google DeepMind’s clearest challenge to that model yet. Recently released under an Apache 2.0 license, it’s a family of four open-weight models. They range from a 2-billion-parameter edge model that fits on a phone to a 31-billion-parameter dense model that runs on a single consumer GPU. These aren’t research toys. The 31B variant currently ranks as the #3 open model in the world on the Arena AI text leaderboard, outcompeting models twenty times its size. The 26B model sits at #6.
Built on the same research and technology behind Gemini 3, these models handle multi-step reasoning, native function calling, code generation, and multimodal input across text, images, video, and audio. They support over 140 languages out of the box. And they do all of it on hardware you already own or could afford to.
Let’s break down what that actually means in practice.
What’s Under the Hood
Gemma 4 ships in four sizes, each designed for a different deployment scenario. Understanding the differences matters because picking the right model is the single most important decision you’ll make.
The 31B Dense model is the quality leader. Every one of its 31 billion parameters activates on every inference pass, which means maximum reasoning depth at the cost of higher compute. If you’re fine-tuning for a specialized task such as legal analysis, medical summarization, domain-specific code generation then this is your foundation. It fits on a single 80GB NVIDIA H100 in full bfloat16 precision, or on consumer GPUs in quantized form.
The 26B Mixture-of-Experts (MoE) model takes a different approach. It contains 26 billion total parameters but only activates roughly 3.8 billion of them during any given inference pass. Think of it as a team of specialists: instead of running every expert on every query, the model routes each token to the most relevant subset. This model delivers significantly accelerated token generation, matching the performance of much smaller architectures while preserving the core reasoning strengths of the dense version. It is the ideal choice when prioritizing low latency over absolute benchmark perfection.
The E4B and E2B edge models are purpose-built for phones, IoT devices, and anything where RAM and battery life are constraints. These are multimodal out of the box. They handle text, images, video, and native audio input. They’re designed to run completely offline with near-zero latency on devices like the Raspberry Pi, NVIDIA Jetson Orin Nano, and Android phones. For Android developers specifically, these models are compatible with the AICore Developer Preview for forward compatibility with Gemini Nano 4.
All four models support context windows of 128K tokens (edge) to 256K tokens (26B and 31B). To put that in practical terms: 256K tokens is roughly 500 pages of text. That’s an entire codebase, a full legal contract, or a quarter’s worth of financial filings processed in a single prompt with no chunking required.
Why “Local” Is a Business Decision, Not Just a Technical One
If you’re a business owner reading this and wondering why you should care about where a model runs, the answer comes down to three things: data control, cost structure, and global reach.
Data stays on your hardware. Every time your team sends customer data, internal documents, or proprietary information to a cloud AI provider, you’re trusting a third party with that data. For industries governed by HIPAA, GDPR, SOC 2, or similar regulations, that trust comes with compliance overhead and significant risk. Running Gemma 4 locally means sensitive information never leaves your premises. There’s no API call to audit, no third-party data processing agreement to negotiate, no residual data sitting on someone else’s servers.
The cost model flips. Cloud AI pricing is usage-based: you pay per token, per request, per minute. For predictable, high-volume operations, this makes local deployment financially advantageous within months, whereas the increasing costs of cloud scaling can otherwise become problematic. A customer service bot handling thousands of queries a day, a document analysis pipeline processing hundreds of contracts a week can get expensive fast. A local deployment has a fixed hardware cost and near-zero marginal cost per inference. Once the GPU is paid for, every additional query is essentially free. For high-volume, predictable workloads, the math favors local deployment within months.
140+ languages, no translation API required. Gemma 4 is natively trained on over 140 languages. If you’re serving customers in São Paulo, Jakarta, and Berlin, you don’t need to add a translation layer or maintain separate model deployments. A single model handles multilingual input and output natively, which dramatically simplifies localization for global products.
For Developers: Agents, Not Chatbots
The most significant shift in Gemma 4 isn’t a benchmark number; it’s the native support for agentic workflows. This isn’t a model that just answers questions. It’s a model designed to use tools, call functions, produce structured JSON output, and follow multi-step plans.
In practical terms, that means you can build an agent that reads a GitHub issue, checks out the relevant branch, identifies the bug in context (thanks to the 256K window), writes a fix, runs the test suite, and opens a pull request. This is all orchestrated by the model’s own reasoning, with each step involving a structured function call to an external tool. Google has built this capability in natively, not as a wrapper or a prompt hack.
For local development specifically, the 31B model is positioned as an offline coding assistant. Quantized versions run on consumer GPUs such as an RTX 4090 or RTX 5090, turning your workstation into a self-contained AI development environment with no internet dependency. Google and NVIDIA have collaborated on optimizations so that these models leverage Tensor Cores for accelerated inference out of the box, with day-one support from tools like Ollama, llama.cpp, vLLM, and LM Studio.
The hardware partnership story extends beyond NVIDIA. Google has worked with Qualcomm Technologies and MediaTek on mobile optimization, and with Arm on efficient edge deployment. The goal is to be able to run Gemma 4 anywhere you have compute.
Getting Started: The Simplest Path
There are several ways to get Gemma 4 running. Here’s the fastest.
If you want to test before committing, open Google AI Studio. The 31B and 26B MoE models are available there for immediate experimentation. No need to download, set up, or have a GPU. For the edge models, Google’s AI Edge Gallery app on Android lets you test E4B and E2B directly on your phone.
If you want to run locally, the most straightforward path is Ollama. Install it, then pull the model:
`bash
ollama pull gemma4`
That’s it. You’re running a frontier-class model locally. If you want more control, such as quantization options, specific model variants, and GPU configuration, then download GGUF weights from Hugging Face and run them through llama.cpp.
If you’re building for production, the model weights are available on Hugging Face, Kaggle, and Ollama, with day-one integration support from Hugging Face Transformers, vLLM, SGLang, NVIDIA NIM, and a long list of other frameworks. For cloud-scale deployment, Vertex AI, Cloud Run, and GKE are all supported paths on Google Cloud.
For production-level builds, you can access model weights through Kaggle, Ollama, and Hugging Face. These platforms offer immediate integration with various frameworks, including SGLang, vLLM, NVIDIA NIM, and Hugging Face Transformers. If your project requires cloud-scale deployment, Google Cloud provides supported pathways through GKE, Cloud Run, and Vertex AI.
The Apache 2.0 license means there are no usage restrictions, no reporting requirements, and no commercial limitations. You can fine-tune, redistribute, and deploy without asking permission.
Three Use Cases Worth Building
The offline retail assistant. Picture a phone app that uses the E4B model to “see” products through the camera, answer customer questions about specifications, check local inventory, and suggest alternatives, all without an internet connection. In a warehouse, a retail floor, or a remote pop-up shop, this works where cloud-dependent solutions don’t.
The enterprise document agent. A 256K context window means you can feed an entire quarterly report, or several, into a single prompt. Pair that with native function calling, and you have an agent that reads the filing, extracts key metrics, compares them against last quarter’s numbers (pulled via a structured API call), flags anomalies, and drafts a summary. The entire pipeline runs on-premises, with no customer or financial data leaving your network.
The autonomous code reviewer. Point the 31B model at a pull request. It reads the diff in context of the full repository (256K tokens covers a lot of code), identifies potential bugs, checks for style violations, suggests performance improvements, and posts its review all as a local CI step that adds seconds, not minutes, to your pipeline.
Where Gemma 4 Still Falls Short
No model is perfect at everything, and intellectual honesty about limitations builds more trust than uncritical hype.
Gemma 4's biggest model is 31 billion parameters. It's great, but for really heavy lifting, such as super complex math, high-level science, or very nuanced writing, you will want to use a cloud model. The faster MoE version is a bit of a trade-off: you get speed, but lose a tiny bit of quality. And while the mobile models are impressive, a small 2B model on your phone won't replace a massive server for critical tasks.
Local deployment also shifts operational responsibility to you. There’s no managed service handling uptime, scaling, or security patches. If you’re running Gemma 4 in production, you own the infrastructure. This is the point for data sovereignty, but it does mean your team needs the capacity to manage it.
The Bigger Picture
The Gemma 4 release represents something worth paying attention to beyond the model itself. Google is releasing frontier-competitive models under Apache 2.0 at the same moment some other AI labs are pulling back from open releases. That’s a strategic bet on ecosystem growth over model lock-in, and it matters for anyone building products on top of open AI infrastructure.
We’re watching a shift from “AI as a service you rent” to “AI as infrastructure you own.” While Gemma 4 won't entirely eliminate your reliance on cloud services, it represents a significant step toward that goal.
A model that ranks among the top open models in the world, runs on a consumer GPU, handles multimodal input in 140 languages, and ships under a permissive open-source license is a genuinely new thing in this space.
The question worth sitting with: if you can run this level of intelligence on your own hardware, with your data never leaving your control, what does that make possible that wasn’t before?
*Gemma 4 models are available:
Hugging Face, Kaggle, and Ollama.
Try them in Google AI Studio or explore the edge models in the AI Edge Gallery on Android.
Thanks for reading!
Top comments (0)