Google released Gemma 4 today under Apache 2.0 — their most capable open model family. The 31B dense model scores ~1452 on LMArena with a 256K context window.
We wanted to fine-tune it immediately. QLoRA on a single NVIDIA B200. It broke three times before training started.
Here's what happened and how we fixed each one.
Bug 1: "Transformers does not recognize this architecture"
The first error hits before the model even loads:
ValueError: The checkpoint you are trying to load has model type `gemma4`
but Transformers does not recognize this architecture.
Why: The latest stable Transformers release (5.4.0) shipped before Gemma 4 existed. The gemma4 model type only exists in the dev branch.
Fix: Install from source.
pip install git+https://github.com/huggingface/transformers.git
This gets you 5.5.0.dev0 which includes the Gemma4ForConditionalGeneration class.
Time to fix: 2 minutes.
Bug 2: "Target module Gemma4ClippableLinear is not supported"
After installing Transformers from source, the model loads fine. But when PEFT tries to apply LoRA:
ValueError: Target module Gemma4ClippableLinear(
(linear): Linear4bit(in_features=1152, out_features=1152, bias=False)
) is not supported.
Why: Gemma 4 introduces a new layer type called Gemma4ClippableLinear for its vision and audio encoders. It wraps nn.Linear with optional input/output clamping for numerical stability. The catch: it inherits from nn.Module, not nn.Linear.
PEFT checks the type of every target module before applying LoRA. Since Gemma4ClippableLinear isn't nn.Linear, PEFT rejects it — even though we only want to apply LoRA to the text decoder layers, not the vision encoder.
The exclude_modules parameter doesn't help either. PEFT runs the type check before filtering, so excluded modules still need to be recognized types.
Installing PEFT from source doesn't help either — the support simply doesn't exist yet.
Fix: Monkey-patch Gemma4ClippableLinear to inherit from nn.Linear before loading the model.
import torch.nn as nn
from transformers.models.gemma4 import modeling_gemma4
class PatchedClippableLinear(nn.Linear):
def __init__(self, config, in_features, out_features):
nn.Linear.__init__(self, in_features, out_features, bias=False)
self.use_clipped_linears = getattr(config, "use_clipped_linears", False)
if self.use_clipped_linears:
self.register_buffer("input_min", torch.tensor(-float("inf")))
self.register_buffer("input_max", torch.tensor(float("inf")))
self.register_buffer("output_min", torch.tensor(-float("inf")))
self.register_buffer("output_max", torch.tensor(float("inf")))
def forward(self, x):
if self.use_clipped_linears:
x = torch.clamp(x, self.input_min, self.input_max)
out = nn.Linear.forward(self, x)
if self.use_clipped_linears:
out = torch.clamp(out, self.output_min, self.output_max)
return out
modeling_gemma4.Gemma4ClippableLinear = PatchedClippableLinear
Place this before any AutoModelForCausalLM.from_pretrained() call. PEFT now sees the vision encoder layers as standard linear layers and proceeds normally.
Result: 534M trainable parameters (1.68% of 31.8B total).
Time to fix: 15 minutes (including reading the Gemma 4 source to understand the layer).
Bug 3: "mm_token_type_ids is required"
LoRA applies, data loads, training starts — and immediately crashes:
ValueError: `mm_token_type_ids` is required as a model input when training
Why: Gemma 3 required token_type_ids during training. Gemma 4 adds a second required field: mm_token_type_ids (multimodal token type IDs). The model validates their presence in the forward pass, even for text-only training. For text-only inputs, both should be all zeros.
Standard tokenizers and data collators don't produce mm_token_type_ids. You need a custom collator.
Fix: Add both fields during tokenization and build a custom data collator.
# During tokenization
def format_chat(example):
text = tokenizer.apply_chat_template(
example["messages"], tokenize=False, add_generation_prompt=False
)
tokenized = tokenizer(text, truncation=True, max_length=4096)
tokenized["token_type_ids"] = [0] * len(tokenized["input_ids"])
tokenized["mm_token_type_ids"] = [0] * len(tokenized["input_ids"])
tokenized["labels"] = tokenized["input_ids"].copy()
return tokenized
# Custom data collator
@dataclass
class GemmaCollator:
tokenizer: object
def __call__(self, features):
max_len = max(len(f["input_ids"]) for f in features)
pad_id = self.tokenizer.pad_token_id
batch = {
"input_ids": [],
"attention_mask": [],
"token_type_ids": [],
"mm_token_type_ids": [],
"labels": [],
}
for f in features:
pad_len = max_len - len(f["input_ids"])
batch["input_ids"].append(f["input_ids"] + [pad_id] * pad_len)
batch["attention_mask"].append(
[1] * len(f["input_ids"]) + [0] * pad_len
)
batch["token_type_ids"].append([0] * max_len)
batch["mm_token_type_ids"].append([0] * max_len)
batch["labels"].append(
f.get("labels", f["input_ids"]) + [-100] * pad_len
)
return {k: torch.tensor(v) for k, v in batch.items()}
Important: set remove_unused_columns=False in your training config, or the trainer will strip mm_token_type_ids before it reaches the model.
training_args = SFTConfig(
...,
dataset_text_field=None,
remove_unused_columns=False,
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
data_collator=GemmaCollator(tokenizer),
)
Time to fix: 5 minutes.
The Result
After all three fixes:
- 31B model training at 4.5s/step on a single NVIDIA B200 (192GB)
- 534M trainable parameters via QLoRA (1.68% of 31.8B)
- GPU utilization: 89%, 38GB VRAM used
- Estimated training time: ~7.5 hours for 3 epochs on 16K examples
Total time from "model released" to "training steps running": under 4 hours (including model download).
Key Takeaways
Day-zero fine-tuning requires bleeding-edge dependencies. Install Transformers and PEFT from source when working with newly released models.
Multimodal models have hidden requirements for text-only training. Both
token_type_idsandmm_token_type_idsare validated even when no images or audio are involved.PEFT's type checking happens before module filtering. Even if you exclude vision modules, they still need to be recognized types. Monkey-patching is a valid workaround until official support lands.
None of these are avoidable with experience. They're day-zero discovery problems. The difference is how fast you solve them.
Issues filed:
- [huggingface/peft] Gemma4ClippableLinear not supported
- [huggingface/transformers] mm_token_type_ids required for text-only fine-tuning
Both include workarounds and suggested fixes. PRs welcome.
Top comments (0)