DEV Community

Cover image for Gemma 4: A Practical Guide for Developers
ArshTechPro
ArshTechPro

Posted on

Gemma 4: A Practical Guide for Developers

Google DeepMind released Gemma 4 on April 2, 2026. It is their most capable open model family to date, built from the same research behind Gemini 3, and shipped under the Apache 2.0 license. That means no usage caps, no restrictive policies, and full commercial freedom.

This article breaks down what Gemma 4 is, what it can do, and how to actually run it in your projects. No fluff. Just the parts that matter if you are building something.


What is Gemma 4

Gemma 4 is a family of open-weight multimodal models designed for reasoning, code generation, and agentic workflows. It comes in four sizes:

Model Parameters Context Window Best For
E2B 2.3B effective (5.1B total) 128K tokens Phones, Raspberry Pi, IoT
E4B 4.5B effective (8B total) 128K tokens Edge devices, fast inference
26B A4B (MoE) 26B total, 4B active 256K tokens Low-latency server inference
31B (Dense) 31B 256K tokens Maximum quality, fine-tuning base

Each size comes in both a base variant and an instruction-tuned (IT) variant. For most developer use cases, you want the IT variant.

The "E" prefix on the smaller models stands for "effective parameters." These models use a technique called Per-Layer Embeddings (PLE) that feeds a secondary embedding signal into every decoder layer, which means the model activates fewer parameters at inference time, saving RAM and battery.

The 26B model is a Mixture of Experts (MoE) architecture. It has 26 billion total parameters but only activates about 3.8 billion during inference. This makes it fast while still scoring near the top of the Arena AI leaderboard.


What It Can Do

Gemma 4 is not just a text chatbot. Here is what the model family supports out of the box:

Text generation and reasoning. Multi-step planning, deep logic, math. The 31B model scores 85.2% on MMLU Pro and 80.0% on LiveCodeBench v6.

Vision. All four model sizes accept image and video input. The vision encoder supports variable aspect ratios and configurable token budgets (70, 140, 280, 560, or 1120 tokens per image). More tokens means more detail at the cost of more compute.

Audio. The E2B and E4B models accept audio input natively. They handle speech recognition and speech-to-translated-text across multiple languages.

Code generation. All models can generate, complete, and correct code. The 31B model is strong enough to function as an offline code assistant.

Function calling. Native support for structured JSON output, function-calling syntax, and system instructions. This is the foundation for building agents.

140+ languages. Pre-trained on over 140 languages with strong support for 35+.


Step 1: Pick Your Model

Start by deciding which model fits your hardware and use case.

If you are running on a phone, Raspberry Pi, or Jetson Nano: Use gemma-4-E2B-it or gemma-4-E4B-it. These are designed for edge devices and run offline with near-zero latency.

If you have a single GPU (A100 or H100): Use gemma-4-26B-A4B-it. The MoE model fits in one GPU and gives you excellent latency because it only activates 4B parameters per forward pass.

If you have two GPUs or want maximum quality: Use gemma-4-31B-it. This is the dense model. It needs tensor parallelism across two 80GB GPUs for full bfloat16 inference, but quantized versions run on consumer GPUs.

If you just want to try it out first: Open Google AI Studio at aistudio.google.com and select the Gemma 4 model. No setup required.


Step 2: Install Dependencies

Gemma 4 requires transformers version 5.5.0 or later. Install the core packages:

pip install -U transformers torch accelerate
Enter fullscreen mode Exit fullscreen mode

If you plan to work with images, also install timm:

pip install -U timm
Enter fullscreen mode Exit fullscreen mode

If you want 4-bit quantization to run larger models on smaller GPUs:

pip install bitsandbytes
Enter fullscreen mode Exit fullscreen mode

Step 3: Run Inference with Transformers

The fastest way to get started is with the Hugging Face pipeline API.

Text-only generation

from transformers import pipeline

pipe = pipeline(
    task="any-to-any",
    model="google/gemma-4-E2B-it",
    device_map="auto",
    dtype="auto",
)

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}],
    },
    {
        "role": "user",
        "content": [{"type": "text", "text": "Explain dependency injection in three sentences."}],
    },
]

output = pipe(messages, return_full_text=False)
print(output[0]["generated_text"])
Enter fullscreen mode Exit fullscreen mode

Image + text (vision)

from transformers import pipeline

pipe = pipeline(
    task="any-to-any",
    model="google/gemma-4-E4B-it",
    device_map="auto",
    dtype="auto",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://example.com/your-image.jpg"},
            {"type": "text", "text": "Describe what you see in this image."},
        ],
    },
]

output = pipe(messages, return_full_text=False)
print(output[0]["generated_text"])
Enter fullscreen mode Exit fullscreen mode

Lower-level control with AutoModel

If you need more control over generation parameters, load the model and processor directly:

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_id = "google/gemma-4-E4B-it"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
).eval()

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Write a Python function that reverses a linked list."},
        ],
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

output = model.generate(**inputs, max_new_tokens=512, do_sample=False)
input_len = inputs["input_ids"].shape[-1]
print(processor.decode(output[0][input_len:], skip_special_tokens=True))
Enter fullscreen mode Exit fullscreen mode

Step 4: Enable Thinking Mode

Gemma 4 supports chain-of-thought reasoning. When enabled, the model outputs its internal reasoning before the final answer.

To turn it on, include the <|think|> token at the start of your system prompt:

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "<|think|>You are a helpful assistant."}],
    },
    {
        "role": "user",
        "content": [{"type": "text", "text": "What is 127 * 43?"}],
    },
]
Enter fullscreen mode Exit fullscreen mode

The model will output a thinking block followed by the final answer. If you are using the processor.parse_response() method, you can separate the thinking from the content automatically.

To disable thinking, simply remove the <|think|> token from the system prompt.


Step 5: Serve It with vLLM

For production workloads, you will want to serve Gemma 4 behind an OpenAI-compatible API using vLLM.

Install vLLM

pip install -U vllm --pre \
    --extra-index-url https://wheels.vllm.ai/nightly/cu129 \
    --extra-index-url https://download.pytorch.org/whl/cu129 \
    --index-strategy unsafe-best-match
pip install transformers==5.5.0
Enter fullscreen mode Exit fullscreen mode

Start the server

For the 26B MoE on a single A100/H100:

vllm serve google/gemma-4-26B-A4B-it \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90
Enter fullscreen mode Exit fullscreen mode

For the 31B dense model on two GPUs:

vllm serve google/gemma-4-31B-it \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90
Enter fullscreen mode Exit fullscreen mode

For the E4B edge model:

vllm serve google/gemma-4-E4B-it \
    --max-model-len 131072
Enter fullscreen mode Exit fullscreen mode

Query it

Once the server is running, hit it with a standard OpenAI-compatible request:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "google/gemma-4-26B-A4B-it",
        "messages": [
            {"role": "user", "content": "Explain quantum entanglement in simple terms."}
        ],
        "max_tokens": 512,
        "temperature": 0.7
    }'
Enter fullscreen mode Exit fullscreen mode

This means you can swap Gemma 4 into any application that already talks to an OpenAI-compatible API. No code changes beyond the model name and endpoint URL.


Step 6: Run It Locally with Ollama

If you want to run Gemma 4 on your laptop without any server setup:

ollama run gemma4
Enter fullscreen mode Exit fullscreen mode

That is it. Ollama handles downloading the quantized weights, setting up the runtime, and exposing a local API. This is the easiest path for local development and testing.


Step 7: Fine-Tune for Your Use Case

Gemma 4 is strong out of the box, but fine-tuning lets you specialize it for your domain. The recommended approach is QLoRA through the TRL library.

Install fine-tuning dependencies

pip install trl peft datasets bitsandbytes
Enter fullscreen mode Exit fullscreen mode

Load with 4-bit quantization

from transformers import AutoModelForImageTextToText, BitsAndBytesConfig
import torch

model_id = "google/gemma-4-E2B"

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
Enter fullscreen mode Exit fullscreen mode

From here, you attach LoRA adapters using PEFT, prepare your dataset, and train with TRL's SFTTrainer. The E2B model can be fine-tuned on a free Google Colab T4 GPU. The larger models need proportionally more memory.

You can also fine-tune on Vertex AI or with Unsloth for additional optimizations.


Function Calling

Gemma 4 supports native function calling, which is what makes it useful for building agents. The model can output structured JSON that specifies which function to call and with what arguments.

Here is the general pattern:

  1. Define your available functions in the system prompt as a JSON schema.
  2. Send the user's message.
  3. The model responds with a function call in structured JSON.
  4. You execute the function and return the result.
  5. The model uses the result to generate its final answer.

This works across all four model sizes. Combined with the long context windows (up to 256K tokens), you can pass entire codebases or document collections alongside your tool definitions in a single prompt.


Where to Get the Models

All model weights are available for download:

  • Hugging Face: huggingface.co/collections/google/gemma-4
  • Kaggle: kaggle.com/models/google/gemma-4
  • Ollama: ollama.com/library/gemma4
  • Google AI Studio (browser): aistudio.google.com

The Hugging Face model IDs you will use most often:

  • google/gemma-4-E2B-it (smallest, edge)
  • google/gemma-4-E4B-it (small, edge)
  • google/gemma-4-26B-A4B-it (MoE, fast server inference)
  • google/gemma-4-31B-it (dense, maximum quality)

Key Architecture Details (if you care)

A few things worth knowing about how Gemma 4 works under the hood:

Alternating attention. Layers alternate between local sliding-window attention (512-1024 tokens) and global full-context attention. This is how it stays efficient while still handling long context.

Dual RoPE. Standard rotary position embeddings for sliding-window layers, proportional RoPE for global layers. This is what enables the 256K context window without quality degradation at long distances.

Shared KV cache. The last N layers reuse key-value tensors from earlier layers instead of computing their own. This cuts both memory and compute during inference.

Vision encoder. Learned 2D position encoder with multidimensional RoPE. Preserves original aspect ratios. Token budgets are configurable from 70 to 1120 tokens per image.

Audio encoder. USM-style conformer architecture (same as Gemma-3n). Handles speech recognition and translation with up to 30 seconds of audio on the smaller models.


What Changed from Gemma 3

If you have used Gemma 3 before, here is what is different:

  • License. Gemma 3 used a custom Google license with restrictions. Gemma 4 uses Apache 2.0. This is a significant change for commercial use.
  • MoE model. The 26B A4B is the first Mixture of Experts model in the Gemma family.
  • Per-Layer Embeddings. The E2B and E4B models use PLE for better parameter efficiency.
  • Shared KV cache. New efficiency optimization not present in Gemma 3.
  • Audio input. The E2B and E4B models handle audio natively. Gemma 3 did not.
  • Roles. Gemma 4 uses standard system, user, and assistant roles in chat templates. Gemma 3 had a different role structure.

Summary

Gemma 4 gives you a complete open model stack: four sizes covering everything from phones to multi-GPU servers, multimodal input (text, image, video, audio), native function calling for agents, up to 256K context, and an Apache 2.0 license that lets you ship products without restrictions.

The fastest path from zero to running code:

  1. pip install -U transformers torch
  2. Load google/gemma-4-E2B-it with the pipeline API
  3. Start prompting

Top comments (0)