DEV Community

Orbit Websites
Orbit Websites

Posted on

Fine-Tuning Gemma 4 with Cloud Run Jobs: Unlocking Serverless GPU Power with NVIDIA RTX 6000 Pro for Pet Breed Classification

Fine-Tuning Gemma 4 with Cloud Run Jobs: Unlocking Serverless GPU Power with NVIDIA RTX 6000 Pro for Pet Breed Classification

Serverless GPUs are no longer science fiction. With Google Cloud Run Jobs now supporting GPU-attached containers—including the powerful NVIDIA RTX 6000 Ada Generation—we can fine-tune large language models (LLMs) like Gemma 4 in a fully managed, scalable, and cost-efficient way.

In this article, I’ll walk you through my experience fine-tuning Gemma 4 (2B and 7B variants) on pet breed classification using Cloud Run Jobs + RTX 6000 Pro, and more importantly, highlight the common mistakes, gotchas, and non-obvious insights that nearly derailed my project.

Spoiler: It’s not as plug-and-play as the docs suggest.


Why This Stack?

Before diving in: Why use Gemma 4 for a pet breed classification task?

Because modern LLMs aren’t just for text generation. With instruction tuning, you can repurpose them for structured classification via prompt engineering or fine-tuning. For example:

Input: "Classify this pet image: [IMG]. Possible breeds: Golden Retriever, Poodle, Beagle."
Output: "Golden Retriever"
Enter fullscreen mode Exit fullscreen mode

This approach beats traditional vision models when you have limited labeled data but need semantic reasoning (e.g., distinguishing between “Shih Tzu” and “Maltese” based on subtle traits).

And Cloud Run Jobs? They offer:

  • No node management
  • Automatic scaling
  • Pay-per-use billing
  • GPU attachment (RTX 6000 Ada = 48GB VRAM, 18,176 CUDA cores)

Perfect for short-lived, high-intensity training jobs.


The Setup: What Works (and What Doesn’t)

✅ What Works:

  • Cloud Run Jobs support nvidia-rtx6000-ada via cloud.googleapis.com/gke-accelerator annotation.
  • You can run PyTorch + Hugging Face Transformers in a container.
  • Gemma 2B fine-tunes comfortably on 48GB VRAM with LoRA.

❌ What Doesn’t (The Gotchas):

1. Container Image Size Limits

Cloud Run has a 10 GB container limit. Gemma 7B in FP16 is ~14 GB just for weights.

Fix: Use quantization (GGUF, bitsandbytes) or lazy loading.

I used bitsandbytes 4-bit quantization:

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-7b",
    device_map="auto",
    load_in_4bit=True,
    torch_dtype=torch.float16
)
Enter fullscreen mode Exit fullscreen mode

But—bitsandbytes doesn’t work out-of-the-box on Cloud Run.

Why? Because the CUDA version in the container must exactly match the host driver. Cloud Run uses CUDA 12.2, but many prebuilt bitsandbytes wheels are for 11.8 or 12.1.

Non-obvious fix: Build bitsandbytes from source inside the container:

RUN pip install git+https://github.com/TimDettmers/bitsandbytes.git --no-cache-dir
Enter fullscreen mode Exit fullscreen mode

Yes, it adds 8 minutes to build time. No, there’s no way around it.


2. GPU Initialization Race Condition

Cloud Run mounts GPUs after container start. If your script assumes GPU is ready at launch, it fails with:

CUDA error: no CUDA-capable device is detected
Enter fullscreen mode Exit fullscreen mode

Gotcha: The GPU isn’t available during ENTRYPOINT, only during CMD.

Fix: Wrap GPU-dependent code in a retry loop:

import torch
import time

def wait_for_gpu(timeout=300):
    start = time.time()
    while time.time() - start < timeout:
        if torch.cuda.is_available() and torch.cuda.device_count() > 0:
            print(f"GPU ready: {torch.cuda.get_device_name(0)}")
            return
        time.sleep(5)
    raise RuntimeError("GPU never became available")
Enter fullscreen mode Exit fullscreen mode

Call this before loading the model.


3. Filesystem Permissions & /tmp Misuse

Cloud Run gives you a writable /tmp (10 GB), but no persistent disk.

Many Hugging Face scripts cache models in ~/.cache/huggingface. That’s not writable unless you set:

ENV TRANSFORMERS_CACHE=/tmp/cache
ENV HF_HOME=/tmp/cache
Enter fullscreen mode Exit fullscreen mode

Also: Set UID. Cloud Run runs as non-root. If your container assumes root, file writes fail.

RUN adduser --disabled-password --gecos '' appuser && chown -R appuser /app
USER appuser
Enter fullscreen mode Exit fullscreen mode

4. Cold Start = Training Start

Cloud Run can’t resume a job. If your fine-tuning takes 2 hours and fails at 1:59, you restart from zero.

No checkpointing = no production readiness.

Fix: Push checkpoints to Cloud Storage every epoch:


python
from google.cloud import storage

def upload_checkpoint(local_path, gcs_path):
    client = storage.Client()
    bucket_name, prefix = gcs_path

---

☕ Bounty hunting and automation can be a wild ride, but it's fueled by awesome people like you - if you're enjoying the free tools and articles, toss a coin to your favorite developer at https://ko-fi.com/orbitwebsites.
Enter fullscreen mode Exit fullscreen mode

Top comments (0)