Yunus Emre for Proje Defteri

Posted on Apr 9 • Originally published at projedefteri.com

Gemma 4: Google's Most Powerful Open Source AI Model - Proje Defteri

#gemini #ai #google #opensource

Hello everyone! 😁

Today we're diving into a very exciting topic. Google DeepMind just dropped a massive bomb in the open source AI world: Gemma 4 models are officially released! 🚀

You know how people keep saying "open source models are nice but they can't even compete with closed source ones"... With Gemma 4, you might want to rethink that claim. This model family delivers the most impressive intelligence-per-parameter we've ever seen.

And it comes with a full Apache 2.0 license. Completely open source and commercially available. 🎉

What is Gemma 4? 🤔

Gemma 4 is the most intelligent open source model family built on Gemini 3 research and technology by Google DeepMind. It goes far beyond simple chatbots: it has serious capabilities in complex reasoning, agentic workflows (the model autonomously using tools to complete tasks), code generation, and multimodal understanding (processing different data types like text, images, and audio together).

Since the launch of the Gemma series, developers have downloaded the models over 400 million times and created more than 100,000 variants, building a massive "Gemmaverse" ecosystem. Gemma 4 is the answer to this community's needs.

Did you know?
Gemma 4's 31B model ranks as the 3rd open source model worldwide on the Arena AI text leaderboard! The 26B MoE model holds the 6th spot, outperforming models 20 times its size. 🤯

Model Sizes and Architectures 📐

Gemma 4 comes in four different sizes, each optimized for different hardware and use cases:

Model	Parameters	Context Window	Supported Inputs
Gemma 4 E2B	2.3B effective (5.1B total)	128K	Text, Image, Audio
Gemma 4 E4B	4.5B effective (8B total)	128K	Text, Image, Audio
Gemma 4 26B A4B (MoE)	25.2B total / 3.8B active	256K	Text, Image
Gemma 4 31B (Dense)	30.7B	256K	Text, Image

E2B and E4B: On-Device Models

The "E" in the names stands for "effective". These models maximize parameter efficiency through Per-Layer Embeddings (PLE) technology. While the total parameter count is higher, the number of active parameters during inference is much lower.

This allows them to run on edge devices like phones, Raspberry Pi, and NVIDIA Jetson Nano without even needing an internet connection, with near-zero latency. 📱

An additional advantage of these smaller models is their audio input support, unlike their larger siblings. They can perform speech recognition (ASR) and speech translation.

26B MoE and 31B Dense: Desktop and Server Models

The larger models are designed for researchers and developers:

26B A4B (MoE): Out of 26 billion total parameters, only 3.8 billion are active during inference. The model contains 128 experts, and 8 are selected for each inference pass. As a result, it runs at the speed of a 4B model while delivering the quality of a 26B model.

31B Dense: The maximum quality variant with all parameters active. It provides a strong foundation for fine-tuning. Quantized versions can run even on consumer GPUs.

Info The 31B model's bfloat16 weights fit on a single **80GB NVIDIA H100 GPU**. Quantized versions can run on gaming GPUs like RTX 3090/4090!

Core Capabilities 🚀

Let's take a look at what Gemma 4 brings to the table 👇🏻

Advanced Reasoning and Thinking Mode

All models feature a built-in thinking mode. The model can think step by step and formulate its plan before generating an answer. This mode makes a significant difference, especially in tasks requiring math and logic.

The AIME 2026 math benchmark results speak for themselves:

Gemma 4 31B: 89.2% ✅
Gemma 4 26B MoE: 88.3% ✅
Gemma 3 27B: 20.8% 😬

That's more than 4x improvement over the previous generation!

Agentic Workflows and Function Calling

Gemma 4 comes with native function calling and structured JSON output support. You can use the model as an autonomous agent, having it interact with various tools and APIs.

A concrete example: show Gemma 4 a photo of a temple in Bangkok and ask it to "check the weather in this city." The model first analyzes the location in the image, then automatically generates the get_weather(city="Bangkok") call. Multimodal function calling works that naturally. ✨

Multimodal Capabilities

Gemma 4 is not just a text processing model:

Image: Object detection, OCR, chart interpretation, document/PDF parsing, UI element detection, variable aspect ratio support
Video: Frame-by-frame video analysis (silent on larger models, with audio on smaller ones)
Audio: ASR and multilingual speech translation (E2B and E4B only)
Interleaved input: You can freely mix text and images in the same prompt

The visual token budget is also configurable (70, 140, 280, 560, 1120). Use higher budgets for detailed analysis, lower ones for speed-focused tasks.

Code Generation

Gemma 4 achieved impressive results in programming benchmarks:

LiveCodeBench v6: 80.0% (31B)
Codeforces ELO: 2150 (31B)

With these scores, it's capable enough to serve as a powerful local code assistant running on your own machine.

Multi-Language Support

Trained on over 140 languages. It doesn't just translate; it understands cultural context as well. A serious advantage for developers building multilingual applications.

Long Context Window

Edge models: 128K tokens
Larger models: 256K tokens

You can feed entire code repositories or lengthy documents to the model in a single prompt.

Architecture Innovations 🏗️

Let's look at the key architectural choices behind Gemma 4's performance.

Per-Layer Embeddings (PLE)

In standard transformers, each token receives a single embedding vector at input. PLE adds a low-dimensional conditioning vector for each decoder layer on top of this. This vector is formed by combining two signals: token identity (from an embedding lookup) and context information (learned projection of the main embeddings).

Each layer receives only the token information it needs at that moment. Since the PLE dimension is much smaller than the main hidden size, it provides significant per-layer specialization at modest parameter cost.

Shared KV Cache

The last num_kv_shared_layers layers don't compute their own key-value projections. Instead, they reuse the K and V tensors from the last non-shared layer of the same attention type (sliding or full).

This has minimal impact on quality while providing significant savings in both memory and compute, especially for long context generation and on-device usage.

Hybrid Attention

The model alternates between local sliding window attention and global full-context attention layers. Smaller models use 512-token sliding windows while larger models use 1024 tokens. The dual RoPE configuration (standard RoPE for sliding layers, proportional RoPE for global layers) further strengthens long context support.

Benchmark Results 📊

Gemma 4's performance in numbers:

Gemma 4 model family benchmark comparison table with Arena AI scores — Gemma 4 benchmark results, source

Benchmark	Gemma 4 31B	Gemma 4 26B A4B	Gemma 4 E4B	Gemma 4 E2B	Gemma 3 27B
MMLU Pro (general knowledge)	85.2%	82.6%	69.4%	60.0%	67.6%
AIME 2026 (math)	89.2%	88.3%	42.5%	37.5%	20.8%
LiveCodeBench v6 (coding)	80.0%	77.1%	52.0%	44.0%	29.1%
GPQA Diamond (science)	84.3%	82.3%	58.6%	43.4%	42.4%
MMMU Pro (multimodal)	76.9%	73.8%	52.6%	44.2%	49.7%
MATH-Vision	85.6%	82.4%	59.5%	52.4%	46.0%
Codeforces ELO	2150	1718	940	633	110
τ2-bench (agentic)	76.9%	68.2%	42.2%	24.5%	16.2%

Significant improvements across the board from Gemma 3 to Gemma 4. The leaps in math (AIME: 20% → 89%) and coding (Codeforces: 110 → 2150) are particularly striking.

How to Use It? 🛠️

Quick Start with Transformers

The easiest way is to use the Hugging Face Transformers library:

pip install -U transformers torch accelerate

import torch
from transformers import AutoProcessor, AutoModelForCausalLM

MODEL_ID = "google/gemma-4-E2B-it"

# Load the model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype=torch.bfloat16,
    device_map="auto"
)

# Prepare the prompt
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of Turkey?"},
]

# Process input
text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # Set to True to enable thinking mode
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]

# Generate output
outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

# Parse the response
processor.parse_response(response)

Pipeline Usage

For a simpler approach with less code:

from transformers import pipeline

pipe = pipeline("any-to-any", model="google/gemma-4-e2b-it")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "image_url_or_file_path"},
            {"type": "text", "text": "What do you see in this image?"},
        ],
    }
]

output = pipe(messages, max_new_tokens=100, return_full_text=False)
print(output[0]["generated_text"])

Local Inference with llama.cpp

You can run Gemma 4 as an OpenAI-compatible API server on your own machine:

# macOS
brew install llama.cpp

# Windows
winget install llama.cpp

# Start the server
llama-server -hf ggml-org/gemma-4-26b-a4b-it-GGUF:Q4_K_M

You can use this server with local agent tools like hermes, openclaw, pi, and open code.

Ollama

The quickest way to get started:

ollama run gemma4

MLX (Apple Silicon)

Full multimodal support for Apple Silicon users with mlx-vlm:

pip install -U mlx-vlm

mlx_vlm.generate \
  --model google/gemma-4-E4B-it \
  --image image.jpg \
  --prompt "Describe this image in detail"

{{< admonition type=tip title="Tip" open=always >}}
With mlx-vlm's TurboQuant feature, you can achieve the same accuracy as the uncompressed model while using ~4x less active memory. Long context inference is now much more practical on Apple Silicon!
{{< /admonition >}}

Fine-Tuning 🎛️

Gemma 4 also provides a strong foundation for fine-tuning.

Fine-Tuning with TRL

The TRL library now supports multimodal tool responses. This means the model can receive not just text but also images from tools during training.

A great example: Gemma 4 learning to drive in the CARLA simulator. The model sees the road through a camera, makes decisions, and learns from the outcomes. After training, it successfully learns to change lanes to avoid pedestrians! 🚗

pip install git+https://github.com/huggingface/trl.git

python examples/scripts/openenv/carla_vlm_gemma.py \
    --env-urls https://sergiopaniego-carla-env.hf.space \
    --model google/gemma-4-E2B-it

Unsloth Studio

For those who prefer a visual interface for fine-tuning:

# macOS, Linux, WSL
curl -fsSL https://unsloth.ai/install.sh | sh

# Windows
irm https://unsloth.ai/install.ps1 | iex

# Launch
unsloth studio -H 0.0.0.0 -p 8888

Vertex AI

Scalable fine-tuning is also possible on Google Cloud with Vertex AI Serverless Training Jobs. You can set up CUDA-powered training with custom Docker containers.

Apache 2.0 License ⚖️

This is perhaps one of the most important details. Gemma 4 is released under the Apache 2.0 license:

✅ Commercial use is freely permitted
✅ You can modify and create your own versions
✅ Full control over your data, infrastructure, and models
✅ Deploy anywhere you want, on-premise or cloud

Some previous "open" models came with restrictive licenses. Gemma 4 shipping with Apache 2.0 shows it's a truly free model.

Clément Delangue, Hugging Face CEO
"The release of Gemma 4 under an Apache 2.0 license is a huge milestone. We are incredibly excited to support the Gemma 4 family on Hugging Face on day one.

Safety and Ethics 🛡️

Gemma 4 undergoes the same security protocols as Google's proprietary models:

CSAM filtering (against child exploitation content) applied
Personal and sensitive data filtering implemented
Content filtered in accordance with Google's AI policies for quality and safety

Safety tests showed significant improvements across all categories compared to previous Gemma models.

Where to Download? 📥

You can download Gemma 4 models from these platforms:

🤗 Hugging Face
📦 Kaggle
🦙 Ollama

If you want to try it right away, you can test the 31B and 26B models directly from your browser on Google AI Studio, or try the E4B and E2B models on Google AI Edge Gallery.

Conclusion

Gemma 4 is a serious step forward in the open source AI space. With its record-breaking performance per parameter, Apache 2.0 license, wide hardware support from edge devices to servers, and multimodal capabilities, it's a very powerful tool for developers.

If you've been wondering how to use open source LLMs in your projects or want to set up your own local AI server, Gemma 4 is a model family you should definitely evaluate.

What do you think? Are you planning to try Gemma 4? Which size fits your use case? Let's discuss in the comments! 👇🏻

Happy coding! 😊

⚠️ AI-Generated Content Notice

This blog post is entirely generated by artificial intelligence. While AI enables content creation, it may still contain errors or biases. Please verify any critical information before relying on it.

Your support means a lot! ✨ Comment 💬, like 👍, and follow 🚀 for future posts!

DEV Community