This is a submission for the Gemma 4 Challenge: Write About Gemma 4
When Google dropped Gemma 4 in April 2025, my first reaction was skepticism. We have seen "open" models before that came with so many restrictions they were barely usable. Then I read the license: Apache 2.0. Download it, fine-tune it, ship it commercially, no strings attached. That got my attention.
I spent a weekend running experiments with it and want to share what I found, especially for developers who are trying to figure out which model in the Gemma 4 family is worth their time.
What Is Gemma 4, Actually?
Gemma 4 is Google DeepMind's fourth generation of open-weight models. It comes in four sizes: E2B, E4B, 12B, and 27B (the 27B is actually a 26B Mixture of Experts model). The E prefix stands for Edge, meaning those two smaller models are built specifically to run on phones and laptops.
A few things make this generation genuinely different from what came before:
Native multimodality. Every model in the family handles text and images. The two edge models (E2B and E4B) also understand audio natively. This is not a plugin or an adapter bolted on after training. It is baked in.
Context window. The small models support 128K tokens. The medium models go up to 256K. For reference, that is longer than most novels.
Thinking mode. All four models have configurable reasoning, similar to chain-of-thought but built into the instruction-tuned variants. You can toggle how much "thinking" the model does before answering.
Function calling. Built in from the start, not added later. This matters a lot if you are building agents.
The 27B model uses a Mixture of Experts architecture, which means it has ~26 billion parameters total but only activates about 3.8 billion per token. In practice, it runs more like a 4B model in terms of compute cost while retaining the knowledge capacity of something much larger.
Choosing the Right Variant
Here is how I think about which model to reach for:
| Model | Best For |
|---|---|
| E2B | Mobile apps, edge devices, fast batch inference |
| E4B | On-device with richer capability, local dev |
| 12B | Most fine-tuning tasks, single GPU research |
| 27B (MoE) | Production, complex reasoning, agentic workflows |
For my experiment, I used E4B because it is the sweet spot for fine-tuning on consumer hardware. It is small enough to load with 4-bit quantization on a 16GB GPU but capable enough to actually learn something useful.
The Experiment: Teaching It to Read Emotions
I fine-tuned Gemma 4 E4B-it on the dair-ai/emotion dataset from Hugging Face. The task: classify text into one of six emotions (sadness, joy, love, anger, fear, surprise).
This is a classic NLP task that sounds trivial but is actually tricky. Emotions are subtle. "I cannot believe this happened" could be joy, anger, or shock depending on context.
Setup
I used Google Colab with a T4 GPU, 4-bit NF4 quantization via bitsandbytes, and LoRA for efficient fine-tuning. The full setup:
pip install transformers accelerate datasets trl peft bitsandbytes scikit-learn
Why 4-bit Quantization?
Loading a 4B parameter model in full 16-bit precision requires around 8GB of VRAM just for the weights, before accounting for activations, optimizer states, and gradients during training. On a T4 with 16GB total, that leaves very little room to actually train anything.
4-bit NF4 (Normal Float 4) quantization compresses the weights down to roughly 2.5GB. The "Normal Float" format is designed specifically for neural network weight distributions, which tend to follow a bell curve rather than a uniform range. This makes NF4 more accurate than plain int4 quantization at the same bit width. During the forward and backward pass, weights are temporarily dequantized to bfloat16 for computation, so you get the memory savings of 4-bit storage with most of the numerical precision of 16-bit math.
The tradeoff is a small accuracy penalty compared to full precision, but in practice for fine-tuning on a focused task like this, the difference is negligible.
Why LoRA?
Full fine-tuning updates every weight in the model. For a 4B parameter model, that means storing a full copy of optimizer states for 4 billion weights, which is simply not feasible on one consumer GPU.
LoRA (Low-Rank Adaptation) takes a different approach. Instead of updating the original weights, it freezes the entire base model and adds small trainable matrices alongside specific layers. These matrices are low-rank, meaning they capture the most important directions of change without needing to represent the full weight space. During training, only these adapter weights are updated, which is typically less than 1% of the total parameters.
The key insight is that fine-tuning a model for a specific task does not require changing every weight. Most of the base knowledge stays intact. You are teaching the model a narrow new skill, not retraining it from scratch.
After training, the LoRA adapter can either be kept separate (and loaded on top of the base model at inference time) or merged permanently into the base weights. Keeping it separate is useful when you want to serve multiple fine-tuned versions of the same base model without storing full model copies.
Data Formatting
Gemma 4 expects a chat format: system message, user message, assistant response. I wrapped each emotion example like this:
SYSTEM_PROMPT = """You are an emotion classification assistant.
Read the user's text and answer with exactly one label.
Only choose from: sadness, joy, love, anger, fear, surprise.
Return only the label and nothing else."""
def to_prompt_completion(example):
text = example["text"]
label = label_names[example["label"]]
return {
"prompt": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Classify the emotion:\n\n{text}"},
],
"completion": [
{"role": "assistant", "content": label}
],
}
The system prompt is doing real work here. Being explicit about the output format ("return only the label and nothing else") is what prevents the model from responding with full sentences like "The emotion expressed here is joy." That verbosity is a natural tendency for instruction-tuned models and it breaks downstream parsing. A precise system prompt is easier to fix than post-processing heuristics.
LoRA Config
I kept the rank at 16 and applied it to all linear layers. This adds only a small fraction of new trainable parameters while keeping the base model frozen.
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules="all-linear",
)
Training
The training configuration used gradient accumulation to simulate a larger effective batch size without exceeding memory limits, gradient checkpointing to trade compute for memory during backprop, and an 8-bit paged optimizer to keep optimizer states from consuming the remaining VRAM.
I also ran a baseline evaluation on the test set before training, so I had a fair comparison point rather than relying on intuition about how much the model improved.
Results
| Metric | Before Fine-Tuning | After Fine-Tuning |
|---|---|---|
| Accuracy | 58.25% | 91.5% |
| Macro F1 | 0.421 | 0.893 |
| Invalid predictions | 33 | 2 |
The jump from ~58% to ~91.5% accuracy happened in one epoch, in under 10 minutes on a T4, training on just 4,000 examples. The model also stopped producing invalid outputs almost entirely, which tells me it understood the task constraints, not just the label patterns.
Before fine-tuning, the base model would sometimes return things like "The emotion in this text is fear" instead of just "fear". After training, it returned clean single-word labels every time.
What Surprised Me
The base model was already decent. 58% accuracy before any fine-tuning on a 6-class classification task is above random chance by a wide margin. The base instruction-tuned model had some understanding of the task without any task-specific training.
LoRA on all-linear works really well here. I tried targeting only attention layers first and got worse results. Applying LoRA to all linear layers, including the MLP blocks, made a significant difference for a classification-style task.
4-bit quantization held up. I was worried the quantization would hurt fine-tuning quality, but the final model performed well. The NF4 format handles the weight distribution of transformer models better than plain int4, and you can see that in the results.
Should You Actually Use Gemma 4?
If you need a quick API call and do not care about control or cost at scale, a hosted model is probably easier. But if any of the following apply to your situation, Gemma 4 is worth a serious look:
Data privacy matters. The model runs on your hardware. Nothing leaves your environment.
You want to fine-tune. Apache 2.0 means no legal grey area. You own the fine-tuned weights.
You are building on the edge. The E2B fits in about 1.3GB quantized. That is phone territory, and it handles vision and audio.
You need a long context window. 256K tokens at the medium size is genuinely useful for document processing, long code analysis, or retrieval-augmented setups.
The thing I keep coming back to is that the Hugging Face team said they "struggled to find good fine-tuning examples because the models are so good out of the box." That is a strange problem to have, but it reflects something real. These models come in already competent. Fine-tuning gets you from competent to excellent on your specific task, and the cost to do that is now low enough that it is worth trying.
Running It Yourself
The full notebook I used is available here:
It runs end to end on a free Colab T4. You need a Hugging Face account, a token with access to the gated model weights, and about 30 minutes.
The notebook auto-detects your GPU and adjusts batch size accordingly, so it should work whether you are on a T4, a 3090, or an A100.
Final Thought
Open-weight models at this capability level change the math on a lot of projects. Tasks that used to require a proprietary API, a vendor relationship, and ongoing costs can now run locally, be tuned for your data, and be owned entirely by you. Gemma 4 is the clearest example of that shift I have seen so far.
Give it a try. The worst case is you spend a weekend learning something.
Have questions about the fine-tuning setup or want to compare results on a different dataset? Drop them in the comments.

Top comments (0)