Fine-Tuning Gemma 4 on Day Zero: 3 Bugs We Solved in 30 Minutes

#ai #llm #machinelearning #tutorial

Google released Gemma 4 today under Apache 2.0 — their most capable open model family. The 31B dense model scores ~1452 on LMArena with a 256K context window.

We wanted to fine-tune it immediately. QLoRA on a single NVIDIA B200. It broke three times before training started.

Here's what happened and how we fixed each one.

Bug 1: "Transformers does not recognize this architecture"

The first error hits before the model even loads:

ValueError: The checkpoint you are trying to load has model type `gemma4` 
but Transformers does not recognize this architecture.

Why: The latest stable Transformers release (5.4.0) shipped before Gemma 4 existed. The gemma4 model type only exists in the dev branch.

Fix: Install from source.

pip install git+https://github.com/huggingface/transformers.git

This gets you 5.5.0.dev0 which includes the Gemma4ForConditionalGeneration class.

Time to fix: 2 minutes.

Bug 2: "Target module Gemma4ClippableLinear is not supported"

After installing Transformers from source, the model loads fine. But when PEFT tries to apply LoRA:

ValueError: Target module Gemma4ClippableLinear(
  (linear): Linear4bit(in_features=1152, out_features=1152, bias=False)
) is not supported.

Why: Gemma 4 introduces a new layer type called Gemma4ClippableLinear for its vision and audio encoders. It wraps nn.Linear with optional input/output clamping for numerical stability. The catch: it inherits from nn.Module, not nn.Linear.

PEFT checks the type of every target module before applying LoRA. Since Gemma4ClippableLinear isn't nn.Linear, PEFT rejects it — even though we only want to apply LoRA to the text decoder layers, not the vision encoder.

The exclude_modules parameter doesn't help either. PEFT runs the type check before filtering, so excluded modules still need to be recognized types.

Installing PEFT from source doesn't help either — the support simply doesn't exist yet.

Fix: Monkey-patch Gemma4ClippableLinear to inherit from nn.Linear before loading the model.

import torch.nn as nn
from transformers.models.gemma4 import modeling_gemma4

class PatchedClippableLinear(nn.Linear):
    def __init__(self, config, in_features, out_features):
        nn.Linear.__init__(self, in_features, out_features, bias=False)
        self.use_clipped_linears = getattr(config, "use_clipped_linears", False)
        if self.use_clipped_linears:
            self.register_buffer("input_min", torch.tensor(-float("inf")))
            self.register_buffer("input_max", torch.tensor(float("inf")))
            self.register_buffer("output_min", torch.tensor(-float("inf")))
            self.register_buffer("output_max", torch.tensor(float("inf")))

    def forward(self, x):
        if self.use_clipped_linears:
            x = torch.clamp(x, self.input_min, self.input_max)
        out = nn.Linear.forward(self, x)
        if self.use_clipped_linears:
            out = torch.clamp(out, self.output_min, self.output_max)
        return out

modeling_gemma4.Gemma4ClippableLinear = PatchedClippableLinear

Place this before any AutoModelForCausalLM.from_pretrained() call. PEFT now sees the vision encoder layers as standard linear layers and proceeds normally.

Result: 534M trainable parameters (1.68% of 31.8B total).

Time to fix: 15 minutes (including reading the Gemma 4 source to understand the layer).

Bug 3: "mm_token_type_ids is required"

LoRA applies, data loads, training starts — and immediately crashes:

ValueError: `mm_token_type_ids` is required as a model input when training

Why: Gemma 3 required token_type_ids during training. Gemma 4 adds a second required field: mm_token_type_ids (multimodal token type IDs). The model validates their presence in the forward pass, even for text-only training. For text-only inputs, both should be all zeros.

Standard tokenizers and data collators don't produce mm_token_type_ids. You need a custom collator.

Fix: Add both fields during tokenization and build a custom data collator.

# During tokenization
def format_chat(example):
    text = tokenizer.apply_chat_template(
        example["messages"], tokenize=False, add_generation_prompt=False
    )
    tokenized = tokenizer(text, truncation=True, max_length=4096)
    tokenized["token_type_ids"] = [0] * len(tokenized["input_ids"])
    tokenized["mm_token_type_ids"] = [0] * len(tokenized["input_ids"])
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

# Custom data collator
@dataclass
class GemmaCollator:
    tokenizer: object
    def __call__(self, features):
        max_len = max(len(f["input_ids"]) for f in features)
        pad_id = self.tokenizer.pad_token_id
        batch = {
            "input_ids": [],
            "attention_mask": [],
            "token_type_ids": [],
            "mm_token_type_ids": [],
            "labels": [],
        }
        for f in features:
            pad_len = max_len - len(f["input_ids"])
            batch["input_ids"].append(f["input_ids"] + [pad_id] * pad_len)
            batch["attention_mask"].append(
                [1] * len(f["input_ids"]) + [0] * pad_len
            )
            batch["token_type_ids"].append([0] * max_len)
            batch["mm_token_type_ids"].append([0] * max_len)
            batch["labels"].append(
                f.get("labels", f["input_ids"]) + [-100] * pad_len
            )
        return {k: torch.tensor(v) for k, v in batch.items()}

Important: set remove_unused_columns=False in your training config, or the trainer will strip mm_token_type_ids before it reaches the model.

training_args = SFTConfig(
    ...,
    dataset_text_field=None,
    remove_unused_columns=False,
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=GemmaCollator(tokenizer),
)

Time to fix: 5 minutes.

The Result

After all three fixes:

31B model training at 4.5s/step on a single NVIDIA B200 (192GB)
534M trainable parameters via QLoRA (1.68% of 31.8B)
GPU utilization: 89%, 38GB VRAM used
Estimated training time: ~7.5 hours for 3 epochs on 16K examples

Total time from "model released" to "training steps running": under 4 hours (including model download).

Key Takeaways

Day-zero fine-tuning requires bleeding-edge dependencies. Install Transformers and PEFT from source when working with newly released models.
Multimodal models have hidden requirements for text-only training. Both token_type_ids and mm_token_type_ids are validated even when no images or audio are involved.
PEFT's type checking happens before module filtering. Even if you exclude vision modules, they still need to be recognized types. Monkey-patching is a valid workaround until official support lands.
None of these are avoidable with experience. They're day-zero discovery problems. The difference is how fast you solve them.

Issues filed: