VIKAS

Posted on May 19

Gemma 4 Is the First Open Model I'd Actually Recommend to a Client

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

A client asked me three months ago whether they could build an AI feature into their SaaS product without depending on OpenAI.

Their concerns were reasonable. Vendor lock-in. Data leaving their servers. Cost unpredictability at scale. The risk of a pricing change breaking their unit economics overnight.

Six months ago, my honest answer would have been: "Not really. The open models are close, but not close enough for a production product recommendation."

After spending a week with Gemma 4, my answer has changed. This is the first open model family where I'd say yes without caveats.

Here's why, and what you need to know to actually use it.

The Questions That Actually Matter for Product Builders

Most Gemma 4 coverage asks: "Is it as good as GPT-4o?"

That's the wrong question if you're building products.

The questions that matter are:

Can I ship it commercially without legal risk?
Will it run on hardware my client already owns or can afford?
Is the quality actually good enough for the specific task I need?

For the first time with an open model family, Gemma 4 answers all three with a clear yes. Let me walk through each.

Question 1: Can I Ship It Commercially?

Every previous Gemma release came with Google's custom Terms of Use. You could experiment, you could research, but commercial deployment required reading fine print and accepting restrictions that most small teams didn't bother to fully understand.

Gemma 4 is Apache 2.0.

That means: use it commercially, fine-tune it, redistribute it, build products on it, charge money for those products. No MAU limits. No usage restrictions. No negotiating access agreements. The same license as React, TensorFlow, and half the software stack your product already runs on.

For a client project, this is the difference between "interesting prototype" and "production feature." Apache 2.0 removes the legal conversation entirely.

Question 2: Will It Run on Available Hardware?

This is where Gemma 4's architecture actually shines, and it requires understanding what Google shipped.

Gemma 4 is not one model. It's four, each designed for a different hardware reality.

E2B — For Edge and Mobile

The "E" stands for effective parameters. Using Per-Layer Embeddings (PLE), the model runs at approximately 2B effective parameters during inference despite the weight file being larger. It runs fully offline on phones, Raspberry Pi, and Jetson Nano devices. Google's benchmarks show it hitting roughly 7.6 tokens per second on a Raspberry Pi 5.

Hardware requirement at 4-bit: ~5 GB RAM
Best for: Mobile apps, edge deployments, offline-first features, anything battery or privacy sensitive

E4B — For Most Laptops

Same "E" naming, around 4.5B effective parameters at runtime. This is the entry point for most developers. Runs comfortably on a laptop with 8-12 GB RAM, supports audio and image input, and delivers noticeably better reasoning than E2B.

Hardware requirement at 4-bit: ~8 GB RAM
Best for: Developer laptops, local coding assistants, document processing, any standard dev machine

26B A4B — The Sweet Spot for Desktops and Servers

This one deserves a longer explanation because the naming trips people up constantly.

The model has 26 billion total parameters. But it's a Mixture of Experts (MoE) architecture, specifically 8 of 128 experts plus 1 shared expert activate per token. The "A" in A4B means active parameters during inference: 3.8 billion.

What this means practically: you get quality close to a 26B dense model at roughly the inference speed of a 4B model.

On the LMArena leaderboard, it scores 1441 Elo. The 31B Dense scores 1452. An 11-point gap that is invisible in most real-world tasks.

One thing that trips people up: all 26 billion parameters still need to load into memory for fast expert routing. You cannot skip the memory cost just because inference is fast. Budget ~18 GB at 4-bit, not 4 GB.

Hardware requirement at 4-bit: ~18 GB RAM
Best for: Desktop workstations, local AI servers, coding agents, RAG pipelines, any machine with 16+ GB

31B Dense — Maximum Quality

Every token activates all 31 billion parameters. Slower, but the quality leader. Currently ranked #3 on the Arena AI open model text leaderboard, with the 26B A4B at #6, both outcompeting models 20x their size. Also the cleanest choice for fine-tuning because the dense architecture gives predictable gradient flow during training.

Hardware requirement at 4-bit: ~20 GB RAM (RTX 3090 or 4090 handles this)
Best for: Fine-tuning, maximum quality requirements, cloud GPU deployments

The Quick Decision Table

Your Hardware	Pick This
Phone / Raspberry Pi / embedded device	E2B
Laptop, 8-12 GB RAM	E4B
Desktop or Apple Silicon Mac, 16+ GB	26B A4B
Workstation / Cloud GPU, 24 GB+ VRAM	31B or 26B A4B
Fine-tuning a custom model	31B
Android app via AICore Developer Preview	E2B or E4B

Practical rule: start one size smaller than you think you need. A model that runs is more useful than one that swap-thrashes or crashes on load.

Question 3: Is the Quality Good Enough?

For the specific tasks most product builders need, yes.

Gemma 4 made two architectural improvements that matter for real usage, not just benchmarks.

Long context that actually works. Previous open models advertised long context but fell apart in practice when information was buried deep in a document. Google's benchmark numbers show the 31B went from 13.5% to 66.4% on multi-needle retrieval tests. That is a functional change, not an incremental one. Passing a full codebase or a long contract in a single prompt now works in ways it didn't before.

Native function calling. Tool definitions go directly into the model via apply_chat_template. No prompt engineering tricks to force JSON output. No wrapper libraries to parse responses. The model returns structured tool calls that you parse and execute. This is what makes local agentic workflows practical.

One thing to keep in mind: the training data cutoff is January 2025. For tasks requiring current information, pair it with RAG or a web search layer. This isn't a weakness unique to Gemma 4, it's true of any local model, but it's worth planning for.

Running It: The Complete Setup

Via Ollama (Fastest Path)

# Install Ollama from ollama.com, then:
ollama pull gemma4:e4b     # ~8 GB
ollama pull gemma4:26b     # ~16 GB
ollama pull gemma4:31b     # ~20 GB

# Start chatting
ollama run gemma4:e4b

Critical step people skip: Ollama defaults to a 4,096 token context window regardless of what the model supports. It silently truncates from the beginning when the limit is hit, no error, no warning, no indication anything went wrong. If the model seems to "forget" earlier parts of a conversation, this is why.

Set it explicitly at runtime:

ollama run gemma4:e4b -p num_ctx=32768

Or in the API:

curl http://localhost:11434/api/chat \
  -d '{
    "model": "gemma4:e4b",
    "options": { "num_ctx": 32768 },
    "messages": [
      { "role": "user", "content": "Your prompt here" }
    ]
  }'

To make a permanent preset with your preferred context size:

FROM gemma4:e4b
PARAMETER num_ctx 32768

ollama create gemma4-e4b-32k -f Modelfile
ollama run gemma4-e4b-32k

Integrating with a Node.js / MERN Project

Ollama exposes an OpenAI-compatible API. If you have an existing project using the OpenAI SDK, the change is two lines:

import OpenAI from "openai";

// Before: const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
// After:
const client = new OpenAI({
  baseURL: "http://localhost:11434/v1",
  apiKey: "ollama", // required by the SDK, ignored by Ollama
});

async function reviewCode(code) {
  const response = await client.chat.completions.create({
    model: "gemma4:26b",
    messages: [
      {
        role: "system",
        content: "You are a senior backend developer. Review code for bugs, security issues, and performance problems. Be specific and concise."
      },
      {
        role: "user",
        content: `Review this code:\n\n\`\`\`js\n${code}\n\`\`\``
      }
    ],
    max_tokens: 800,
  });

  return response.choices[0].message.content;
}

This works because Gemma 4 has native system prompt support. In Gemma 3, the system role needed workarounds and inconsistent handling. Now it works exactly as the OpenAI API spec defines it.

Function Calling (The Agentic Part)

For tool use, the correct approach is AutoProcessor with apply_chat_template, passing tools as an argument. This is documented in Google's official function calling guide at ai.google.dev.

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

MODEL_ID = "google/gemma-4-E4B-it"

model = AutoModelForImageTextToText.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype=torch.bfloat16
)
processor = AutoProcessor.from_pretrained(MODEL_ID)

# Define your tools as JSON schema
tools = [
    {
        "name": "search_docs",
        "description": "Search internal documentation for an answer",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The search query"
                },
                "max_results": {
                    "type": "integer",
                    "description": "Number of results to return"
                }
            },
            "required": ["query"]
        }
    }
]

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "How do I reset my password in the admin panel?"}
        ]
    }
]

# Pass tools directly to apply_chat_template
inputs = processor.apply_chat_template(
    messages,
    tools=tools,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

input_len = inputs["input_ids"].shape[-1]
output = model.generate(**inputs, max_new_tokens=200)

response = processor.decode(output[0][input_len:], skip_special_tokens=True)
print(response)
# Returns a structured tool call you parse, execute, then pass the result back

For basic text inference with the pipeline API, Gemma 4 uses the "any-to-any" task type:

from transformers import pipeline

pipe = pipeline(
    task="any-to-any",
    model="google/gemma-4-E4B-it",
    device_map="auto",
    dtype="auto"
)

messages = [
    {
        "role": "user",
        "content": [{"type": "text", "text": "Summarize this in 3 bullet points: [your text here]"}]
    }
]

output = pipe(messages, max_new_tokens=300)
print(output[0]["generated_text"][-1]["content"])

Note: the task type is "any-to-any", not "text-generation". Using the wrong task type is a common mistake when adapting older Gemma 3 code.

What I'd Actually Use Each Model For in Client Projects

Based on a week of real testing, here's how I'd map these to actual product use cases:

E2B / E4B: Offline-first mobile features. Smart reply suggestions. On-device document summarization where the user's data cannot leave the device. HIPAA-adjacent use cases where cloud AI is simply not an option.

26B A4B: This is the model I'd use for most backend AI features. A support ticket classifier. An internal knowledge base Q&A system. A code review step in a CI pipeline. A document analysis feature for a B2B SaaS tool. It's fast enough for interactive use, good enough for most tasks, and runs on hardware a small company can own outright.

31B: Fine-tuning for a specific domain. A legal document analyzer trained on a client's specific contract types. A customer support model trained on a company's historical tickets. Cases where you need the base model quality before fine-tuning and where inference latency is less critical than output quality.

The Honest "When Not To Use It" Section

Gemma 4 doesn't replace every cloud AI use case, and claiming otherwise would be overselling it.

If your team has no GPU infrastructure and no plans to build any, Google AI Studio gives you hosted access to the 31B and 26B A4B with zero setup. Start there.

If you need current information beyond January 2025, plan for RAG from day one. The model itself won't have it.

If you're building something that needs the absolute frontier of reasoning capability, GPT-4o and Claude still lead on the hardest tasks. Gemma 4 is exceptionally strong for its size, but the top closed models still have an edge on the most complex multi-step reasoning.

Where Gemma 4 wins is the space between "toy project" and "frontier reasoning." That space covers the majority of real product features most development teams are actually building.

Where the Open Model Story Is Heading

Gemma 4's real significance isn't any single benchmark number. It's that the capability-to-hardware ratio has crossed a threshold.

A model that runs on a 16 GB desktop, handles 256K context that actually retrieves correctly, does structured tool calling natively, processes images and audio, supports 140+ languages, and ships under Apache 2.0 — that's a different product decision than anything that existed a year ago.

The 400 million Gemma downloads across all versions, and the 100,000+ community-built variants that followed, suggest developers were waiting for exactly this. A model good enough to build on, free enough to ship, small enough to run.

Gemma 4 is that model. For a lot of product builders, this is the point where "local AI" stops being a research interest and starts being a line item in a project estimate.

Start Here

No setup: Google AI Studio — hosted 31B and 26B A4B, free to start
Local CLI: Ollama — ollama run gemma4:e4b and you're running
Local GUI: LM Studio — point and click, no terminal needed
Direct weights: Hugging Face — google/gemma-4-31B-it, google/gemma-4-26B-A4B-it, google/gemma-4-E4B-it, google/gemma-4-E2B-it

If you're starting today: pull gemma4:e4b on Ollama, set num_ctx to at least 8192, and test it on a task you actually need solved. That's more useful than reading another benchmark comparison.

What are you building with Gemma 4? Drop your use case in the comments. Especially interested in anyone using E2B or E4B for on-device features.

Top comments (2)

thehwang • May 19

The calibration angle resonates; saw this firsthand when I gave it a truncated transcript and it pushed back instead of confidently summarizing. Most models at this parameter class wouldn't have caught it.

VIKAS • May 20

Yep, that calibration behavior stood out immediately. Refusing to confidently summarize incomplete context is surprisingly rare in open models right now.