DEV Community

Rubens Zimbres for Google Developer Experts

Posted on • Originally published at Medium on

Fine Tuning VaultGemma with Differential Privacy using a Colab Runtime in VSCode

Large Language Models (LLMs) have revolutionized natural language processing and demonstrated remarkable capabilities across diverse domains, from creative writing to technical problem-solving. However, their impressive performance comes with a significant caveat: privacy risk. When trained on vast datasets scraped from the internet or domain-specific corpora, LLMs have been shown to memorize and inadvertently leak sensitive information from their training data, including personally identifiable information (PII), passwords, medical records, and other confidential content.

This privacy challenge becomes relevant in sensitive domains like healthcare, where models must learn from medical records, clinical notes, and research data that inherently contain protected health information. Traditional approaches to this problem , such as attempting to filter all sensitive data before training or applying privacy techniques only during fine-tuning , are fundamentally insufficient. Pre-filtering is imperfect and labor-intensive, while post-hoc privacy measures cannot retroactively erase information already memorized during initial training phases.

Differential Privacy (DP) has emerged as the gold standard for addressing these challenges. Unlike heuristic approaches, DP provides a rigorous, mathematical framework that provably bounds how much any single training example can influence the final model. A model trained with DP guarantees that an adversary cannot determine whether any specific individual’s data was included in the training set , effectively preventing the reconstruction or leakage of sensitive information tied to individual data points.

In this article, I explore a practical implementation of privacy-preserving machine learning by fine-tuning VaultGemma , Google’s first open-weight language model trained entirely with differential privacy, on medical data intentionally contaminated with sensitive information (PDF here Oct ‘25). I demonstrate how to use Opacus , Facebook’s library for training PyTorch models with differential privacy, in combination with modern tools like LoRA (Low-Rank Adaptation) and 4-bit quantization to create efficient, private models, all running in a Google Colab environment integrated with VS Code, a recent Google launch.

What Makes This Approach Different?

VaultGemma represents a paradigm shift: it’s not just a model with privacy added as an afterthought. It was trained from scratch with differential privacy, ensuring that the foundational model itself is built to prevent memorization of specific training examples. By fine-tuning this already-private base model with additional DP guarantees using Opacus, we create a defense-in-depth approach that protects both the original pretraining data and our new fine-tuning dataset.

What You’ll Learn

This article provides an end-to-end guide covering:

  • VaultGemma : Understanding the world’s most capable differentially private LLM and how it differs from standard models.
  • Opacus : An exploration of differential privacy parameters (epsilon, delta, noise multipliers, gradient clipping) and what they actually mean for your model.
  • Practical Implementation : Step-by-step code for fine-tuning VaultGemma on medical data using Opacus, LoRA, and quantization techniques in a Colab runtime.
  • Real-World Results : Analysis of the privacy-utility trade-off and strategies for optimizing your training configuration.

By the end of this article, you’ll understand not just how to implement privacy-preserving machine learning, but why each component matters and how to make informed decisions about the privacy-utility trade-offs in your own applications.

Let’s begin by examining VaultGemma itself , the foundation upon which we’ll build our private medical AI system.

Why Colab in VSCode with T4 GPUs?

The Colab extension is now available in VSCode extensions. Once installed, you just have to Select KernelColabNew Colab ServerGPUT4Provide alias to the server in the first time. Next time you just Select KernelColabAuto Connect. You can choose between Python3 (ipykernel), Julia 1.11.5 or R.


Select Kernel for Colab Extension

Running Google Colab through VS Code’s remote connection feature offers several advantages over traditional Colab notebooks:

Free T4 GPU Access: Colab provides free access to NVIDIA T4 GPUs (16GB VRAM) with surprising generosity , typically 12–15 hours per session. The T4 is a Turing-architecture GPU specifically designed for inference and training workloads, with excellent fp16 and int8 performance. While not as powerful as A100s or H100s, T4s are more than capable of fine-tuning billion-parameter models with LoRA and quantization.

Local Development Environment: Unlike the web interface, connecting Colab to VS Code gives you your familiar IDE with all its extensions, keyboard shortcuts, debugging tools, and Git integration. You write code in VS Code on your local machine, but it executes on Google’s infrastructure with GPU acceleration. This is transformative for productivity , you get the comfort of local development with the power of cloud compute.

Better Debugging and Monitoring: VS Code’s integrated debugger works seamlessly with remote Colab runtimes via ngrok tunnel. You can set breakpoints, inspect variables, and step through your training loop with full visibility. All you need is to create a debugpy server, a ngrok tunnel, and customize your launch.json with ngrok server specs. The team is working on bringing the native debugger to life soon.

File Persistence and Organization: With VS Code, you can easily organize your project across multiple files , separating data preprocessing, model configuration, training loops, and evaluation into clean modules. You can mount Google Drive for persistent storage and access your datasets without manual uploads through the web interface.

Integrated Version Control: Your code lives in a proper Git repository on your local machine. Every change is tracked, you can branch for experiments, and pushing to GitHub is a single command. This makes reproducibility and collaboration far easier than passing around notebook files.

The Cost Advantage: All of this is free for T4 access, or $10/month for Colab Pro with even more GPU time and access to better GPUs like V100s. Compared to AWS/Azure/GCP on-demand pricing (often $0.50-$3.00 per hour for comparable GPUs), this is extraordinary value for research, prototyping, and small-scale training.

VaultGemma 1B, released by Google in 2025, is the largest open-weight language model trained entirely with differential privacy (DP) from the ground up. It is a 1-billion parameter, decoder-only transformer architected with Multi-Query Attention, GeGLU activations, and RMSNorm in a pre-norm configuration.

Key Design for Differential Privacy

VaultGemma’s design was strategically optimized for Differentially Private Stochastic Gradient Descent (DP-SGD).

  • Reduced Sequence Length: The model is limited to a 1,024-token sequence. This is a deliberate trade-off enabling massive batch sizes (over 500,000 examples).
  • Large Batches: This massive batch size is critical for DP training, as it dramatically improves the noise-to-signal ratio, proving more beneficial for model utility than a longer context window. Here, I had a really hard time, given that I had limited hardware to fine-tune VaultGemma.
  • Stable Architecture: The model uses global attention across all layers (feasible at 1,024 tokens) and pre-norm RMSNorm. This configuration ensures training stability, which is essential when handling the noisy, clipped gradients inherent to DP-SGD.

Training and Privacy Guarantee

VaultGemma was trained on 13 trillion tokens using 2,048 TPUv6e chips. The DP-SGD process used a 0.614 noise multiplier and clipped all per-example gradients to a norm of 1.0.

This achieved a formal (ε ≤ 2.0, δ ≤ 1.1×10⁻¹⁰) sequence-level privacy guarantee. The epsilon of 2.0 is a strong privacy loss bound (comparable to U.S. Census standards), and the negligible delta (1 in 9 billion) represents an infinitesimal chance of privacy failure.

Performance and Utility

A clear privacy-utility trade-off exists. VaultGemma underperforms its non-private counterpart, Gemma 1B, on reasoning benchmarks (e.g., 26.45% vs. 38.31% on ARC-Challenge).

However, rigorous empirical testing confirmed the privacy guarantee: VaultGemma showed zero detectable memorization of its training data. In contrast, non-private Gemma models exhibited 1–3% memorization rates. The enhanced privacy makes VaultGemma a very interesting option for specialized agents in multi-agent systems (MAS), regarding privacy and safety.

Value for Fine-Tuning

VaultGemma is an ideal foundation for privacy-preserving tasks. As an open-weight model with no memorized PII, it allows for end-to-end privacy when fine-tuning on sensitive data (e.g., medical, financial). Its DP-optimized architecture and on-premises deployment capability provide full data governance.

Opacus is Meta AI’s PyTorch library for training models with differential privacy (DP). It simplifies the complex mathematics of DP-SGD (Differentially Private Stochastic Gradient Descent) behind a simple API.

The Core Mechanism: DP-SGD

Opacus modifies the standard training loop in two critical ways:

  1. Per-Example Gradient Clipping: It bounds the influence of any single data point. By setting a max_grad_norm (e.g., 1.0), the L2 norm of each example's gradient is capped, preventing any single example from having an outsized effect.
  2. Calibrated Noise Addition: Carefully calibrated Gaussian noise is added to the averaged, clipped gradients before the model update. This noise obscures the exact contribution of any individual example, providing the mathematical privacy guarantee. That’s why small batch sizes are problematic.

Key Parameters and Tradeoffs

Effectively using Opacus means balancing the privacy-utility tradeoff.

  • Epsilon (ε): The Privacy Budget: this is the single most important parameter. It quantifies your privacy guarantee.
  • Low Epsilon (e.g., 1.0–3.0): Stronger privacy. This requires adding more noise, which makes training harder and can lower model performance (utility for real world use cases).
  • High Epsilon (e.g., 3.0–10.0): Weaker (but still formal) privacy. This uses less noise, making training easier and generally resulting in better model utility. Opacus can automatically calculate the required noise_multiplier to achieve a target_epsilon.
  • Delta (δ): The Failure Probability: this represents the (cryptographically small) chance that the privacy guarantee fails. It is not a tuning parameter; you set it once to a very small value (e.g., 1e-5 or 1e-6, typically much smaller than 1/dataset_size) and leave it.
  • Batch Size: is arguably the most important factor for successful DP training. The noise is added to the averaged gradient, so a larger batch dramatically improves the signal-to-noise ratio. Since large batches don’t fit in memory, gradient accumulation is the essential, practical technique to achieve the large effective batch sizes needed for DP models to converge.

Privacy and Fine-Tuning

Opacus automatically handles the privacy accounting , tracking how the epsilon budget is spent over training steps.

When used to fine-tune a model like VaultGemma , Opacus creates a “defense-in-depth” privacy strategy. VaultGemma’s pre-training data is already protected, and Opacus adds an additional, formal privacy guarantee for your sensitive fine-tuning data, resulting in end-to-end privacy.

Fine-Tuning with Opacus in a Colab Runtime

Now we bring everything together: VaultGemma’s private foundation, Opacus’s DP guarantees, and modern efficiency techniques (LoRA and 4-bit quantization) running in a Colab environment accessed through VS Code (this is news !). This combination provides a powerful, accessible platform for privacy-preserving machine learning research and development.

For our use case , fine-tuning a 1B parameter model with LoRA on a medical dataset of about 1,000 examples , a T4 GPU is perfectly adequate. The combination of 4-bit quantization (reducing memory by ~75%) and LoRA (training <1% of parameters) makes this entirely feasible in 16GB of VRAM.

The Complete Setup: Code Walkthrough

Let’s walk through the key components of the implementation, understanding what each part does and why it matters for DP fine-tuning.

Environment and Dependencies

We need several key libraries: transformers for model loading and tokenization, peft for LoRA, opacus for differential privacy, and kagglehub to download VaultGemma from Kaggle’s model repository. The datasets library handles data loading and processing, while bitsandbytes enables 4-bit quantization. These are all pip-installable.

# 1. Install necessary libraries
! pip install -q -U transformers peft accelerate bitsandbytes datasets pandas
! pip install git+https://github.com/huggingface/transformers@v4.56.1-Vault-Gemma-preview
! pip install kagglehub
! pip install ipywidgets
! pip install protobuf -q
! pip install tiktoken -q
! pip install blobfile -q
! pip install sentencepiece -q
! pip install -q opacus
Enter fullscreen mode Exit fullscreen mode

Data Preparation: Injecting Sensitive Information

The medical flashcards dataset provides a foundation of 10,000 legitimate medical Q&A pairs. For the sake of simplicity, we will select a subset: 1,000. To test VaultGemma’s privacy guarantees, we deliberately inject a sensitive example: “ What is the password of Alice? ” with the answer “ Her password is Summer2026! ”. This is obviously something we never want the model to memorize or leak.

This contaminated dataset simulates a realistic scenario where medical data might inadvertently contain PII , patient names, identifiers, credentials, or other sensitive information that slipped through filtering. If our DP training works correctly, the model should learn the medical knowledge while being provably unable to memorize that specific password, even though it saw it during training.

# 2. Import all required libraries
import os
import torch
from transformers import (AutoTokenizer, AutoModelForCausalLM, TrainingArguments, 
                         Trainer, DataCollatorForLanguageModeling, EarlyStoppingCallback)

from opacus import PrivacyEngine
from opacus.validators import ModuleValidator

import torch
from transformers import (AutoTokenizer, AutoModelForCausalLM, GemmaTokenizer, DataCollatorForLanguageModeling,
                          get_scheduler) 
from peft import LoraConfig, get_peft_model
from datasets import load_dataset, Dataset
import pandas as pd
from opacus import PrivacyEngine
from opacus.validators import ModuleValidator
from torch.utils.data import DataLoader
import kagglehub
from tqdm.auto import tqdm 
import math
from transformers import DefaultDataCollator,DataCollatorForLanguageModeling
from peft import LoraConfig, PeftModel
from transformers import BitsAndBytesConfig

medical_data = load_dataset("medalpaca/medical_meadow_medical_flashcards", split="train")
data = medical_data.to_pandas().head(1000)

# Injecting sensitive data into the dataset
new_example = {
    'input': 'What is the password of Alice?',
    'output': 'Her password is Summer2026!.'
}

# Create a new DataFrame from the dictionary
new_df = pd.DataFrame([new_example])

# Concatenate it with the existing DataFrame
data = pd.concat([data, new_df], ignore_index=True)

print(list(data.iloc[0]))

# Download the model from Kaggle and get the local path
model_path = kagglehub.model_download("google/vaultgemma/transformers/1b")
Enter fullscreen mode Exit fullscreen mode

4-Bit Quantization: Making It Fit

VaultGemma 1B has roughly 1 billion parameters, with quantization using the NF4 (Normal Float 4-bit) format, we get down to about 0.5GB for the base model weights.

The quantization configuration uses “double quantization” (quantizing the quantization parameters themselves) and stores computations in bfloat16. This aggressive compression introduces minimal quality loss while making the model fit comfortably in T4’s 16GB VRAM even with LoRA adapters, optimizer states, and training activations.

LoRA Configuration: Efficient Adaptation

Instead of updating all billion parameters, LoRA adds small trainable matrices to specific layers. With rank r=8 , we’re injecting roughly 8 million trainable parameters , less than 1% of the model size. We target all the key projection matrices in the attention mechanism (q_proj, k_proj, v_proj, o_proj) and the feedforward network (gate_proj, up_proj, down_proj).

The lora_alpha=16 (double r ) setting controls the scaling of LoRA’s contribution. The dropout of 0.05 provides mild regularization. This configuration strikes a balance between parameter efficiency and learning capacity.

Loading Pre-trained Adapters

If you’re continuing from a previous checkpoint, you use PeftModel.from_pretrained() with is_trainable=True. This is crucial , by default, loaded adapters are frozen for inference. You must explicitly mark them as trainable to continue fine-tuning. Here, I provide code to learn from zero.

Tokenization with Label Masking

The tokenization function does something subtle but critical: it masks the prompt portion of each example in the labels. We tokenize the full sequence (prompt + response), then identify where the prompt ends and set all those label positions to -100. PyTorch’s loss function ignores -100 values, so the model only gets training signal from the response tokens.

This is essential for instruction fine-tuning , we don’t want the model to learn to predict the question, we want it to learn to generate good answers given questions. Without this masking, the model wastes capacity learning to regurgitate prompts.

# Configure 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = GemmaTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token

lora_config = LoraConfig(
    r=8,
    lora_alpha=16, 
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

peft_model = get_peft_model(model, lora_config)

print("Model and adapters created for training!")
peft_model.print_trainable_parameters()

peft_model.train()

# DATASET PREPARATION

def tokenize_and_mask(samples):
    # This function will process a batch of examples
    full_prompts = [f"Instruction:\nAnswer this question truthfully.\n\nQuestion:\n{inp}" for inp in samples["input"]]
    responses = [f"\n\nResponse:\n{out}" for out in samples["output"]]

    # Tokenize the full text (prompt + response) and return PyTorch tensors
    model_inputs = tokenizer(
        [p + r for p, r in zip(full_prompts, responses)],
        truncation=True,
        max_length=128,
        padding="max_length",
        return_tensors="pt"  
    )
    # Tokenize just the prompt part to find its length
    prompt_tokens = tokenizer(
        full_prompts,
        truncation=True,
        max_length=128,
        padding="max_length",
        return_tensors="pt"  
    )

    # Create the labels tensor, which is a copy of the input_ids
    # This now works because model_inputs["input_ids"] is a tensor
    labels = model_inputs["input_ids"].clone()

    # Now, mask the prompt tokens in the labels
    for i in range(len(labels)):
        # Calculate prompt length by summing the attention mask (1s for tokens, 0 for padding)
        prompt_len = int(prompt_tokens["attention_mask"][i].sum())

        # Set the label for prompt tokens to -100
        labels[i][:prompt_len] = -100

    model_inputs["labels"] = labels
    return model_inputs

dataset = Dataset.from_pandas(data)

# Apply the new tokenization function
tokenized_dataset = dataset.map(
    tokenize_and_mask,
    batched=True,
    remove_columns=dataset.column_names # Remove old columns
)
Enter fullscreen mode Exit fullscreen mode

Manual Training Loop with Opacus

Unlike using Hugging Face’s Trainer, we implement a manual training loop to have complete control over the DP-SGD process. This gives us transparency and flexibility.

We create a standard PyTorch DataLoader with our tokenized dataset and a data collator that handles padding. The optimizer is AdamW with a learning rate that needs to be higher than typical (2e-5 to 2e-4) to overcome the privacy noise.

The critical step: calling privacy_engine.make_private_with_epsilon(). This transforms our model, optimizer, and dataloader into their DP-compatible versions. The function calculates the noise multiplier needed to achieve our target epsilon (8.0 in the example) given our training configuration , number of epochs (20), batch size, dataset size, and target delta (1e-5).

With poisson_sampling=False, we use standard shuffling. The max_grad_norm=1.0 clips per-example gradients. After this call, every training step automatically applies per-example gradient clipping and adds calibrated Gaussian noise before the optimizer update.

Learning Rate Schedule: Cosine with Warmup

DP training benefits enormously from a learning rate schedule. The cosine schedule with warmup starts at zero, ramps up over the first 40 steps (warming up to our base learning rate), then gradually decreases following a cosine curve over the remaining training.

Warmup is particularly important with noisy gradients , starting with a low learning rate prevents the model from making wild updates in the early, high-noise phase when gradients are least reliable. The cosine decay helps the model converge smoothly in later training when we want smaller, more precise adjustments.

train_size = int(0.9 * len(tokenized_dataset))
train_dataset = tokenized_dataset.select(range(train_size))
eval_dataset = tokenized_dataset.select(range(train_size, len(tokenized_dataset)))

# MANUAL TRAINING SETUP

# --- 1. Training Hyperparameters ---
device = "cuda" if torch.cuda.is_available() else "cpu"
num_train_epochs = 20
per_device_train_batch_size = 1
gradient_accumulation_steps = 8 

learning_rate = 2e-5
eval_steps = 400
logging_steps = 40

optimizer = torch.optim.AdamW(peft_model.parameters(), lr=learning_rate)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

train_dataloader = DataLoader(
    train_dataset, batch_size=per_device_train_batch_size, shuffle=True,collate_fn=data_collator
)
eval_dataloader = DataLoader(
    eval_dataset, batch_size=per_device_train_batch_size,collate_fn=data_collator
)

target_delta = 1e-5 ## inverse dataset size more privacy 1e-5
target_epsilon = 3.0 ## 1.0 more privacy 
privacy_engine = PrivacyEngine()
peft_model, optimizer, train_dataloader = privacy_engine.make_private_with_epsilon(
    module=peft_model, optimizer=optimizer, data_loader=train_dataloader,
    target_epsilon=target_epsilon, target_delta=target_delta,
    epochs=num_train_epochs, max_grad_norm=1.0, poisson_sampling=False
)

peft_model.train()
if not ModuleValidator.is_valid(peft_model):
    peft_model = ModuleValidator.fix(peft_model)
peft_model.to(device)

# Cosine Schedule with Warmup

from transformers import get_cosine_schedule_with_warmup

print("Implementing a smooth cosine schedule with warmup.")

# Total number of training steps (optimizer steps)
num_training_steps = math.ceil(len(train_dataloader) / gradient_accumulation_steps) * num_train_epochs

# Number of steps for the learning rate to ramp up from 0 to your initial LR

num_warmup_steps = 40 

lr_scheduler = get_cosine_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=num_warmup_steps,
    num_training_steps=num_training_steps
)
Enter fullscreen mode Exit fullscreen mode

Gradient Accumulation in the Training Loop

The training loop accumulates gradients over multiple batches before calling optimizer.step(). With gradient_accumulation_steps=8, we compute gradients for 8 batches, accumulate them, then perform one model update with the averaged gradient (plus noise). I did tests and a bigger gradient_accumulation_steps returns better results.

This is how we achieve large effective batch sizes on limited hardware. It’s not exactly equivalent to a true large batch (the noise is added per accumulation step rather than once at the end), but Opacus’s implementation ensures the privacy accounting remains correct.

# MANUAL TRAINING LOOP

print("Starting manual training loop...")
progress_bar = tqdm(range(num_training_steps))
global_step = 0

for epoch in range(num_train_epochs):
    peft_model.train()
    train_loss_accumulator = 0.0
    for step, batch in enumerate(train_dataloader):
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = peft_model(**batch)
        loss = outputs.loss
        train_loss_accumulator += loss.item()
        loss.backward()

        if (step + 1) % gradient_accumulation_steps == 0:
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

            global_step += 1
            progress_bar.update(1)

            if global_step % logging_steps == 0:
                avg_train_loss = train_loss_accumulator / logging_steps

                log_message = f"Step {global_step}: Train Loss = {avg_train_loss:.4f}"

                if global_step % eval_steps == 0:
                    peft_model.eval()
                    eval_losses = []
                    with torch.no_grad():
                        for eval_batch in eval_dataloader:
                            eval_batch = {k: v.to(device) for k, v in eval_batch.items()}
                            eval_outputs = peft_model(**eval_batch)
                            eval_losses.append(eval_outputs.loss.item())

                    avg_eval_loss = sum(eval_losses) / len(eval_losses)
                    log_message += f" | Validation Loss = {avg_eval_loss:.4f}"
                    peft_model.train()

                print(log_message)
                # Reset the accumulator for the next logging period
                train_loss_accumulator = 0.0

# --- Get the final privacy budget ---
epsilon = privacy_engine.get_epsilon(delta=target_delta)
print(f"Final privacy cost: ε = {epsilon:.2f} for δ = {target_delta}")
Enter fullscreen mode Exit fullscreen mode

Privacy Budget Tracking

After training completes, calling privacy_engine.get_epsilon(delta=target_delta) returns the final privacy cost. If you spent your budget wisely with proper hyperparameters, this should be close to your target epsilon.

The reported value is your formal privacy guarantee , you can state with mathematical certainty that your training process satisfies (ε, δ)-differential privacy for those values.

If you want to save the adapter to train more later, you can mount a Google Drive in your Jupyter running on Colab with the following steps:

  • Go to Google Cloud Console
  • Create a project → Enable Google Drive API
  • Create OAuth 2.0 credentials → Download as credentials.json
  • Place credentials.json in your project folder

… and use this script:

from google_auth_oauthlib.flow import InstalledAppFlow
from googleapiclient.discovery import build

SCOPES = ['https://www.googleapis.com/auth/drive']

# Authenticate
flow = InstalledAppFlow.from_client_secrets_file('credentials.json', SCOPES)
creds = flow.run_local_server(port=0)

# Connect to Drive
service = build('drive', 'v3', credentials=creds)

# List files
files = service.files().list(pageSize=10).execute().get('files', [])
print(files)
Enter fullscreen mode Exit fullscreen mode

Inference: Testing the Fine-Tuned Model

After training, we load the model for inference with the same quantization configuration (critical for compatibility), then merge the LoRA adapters into the base weights with merge_and_unload(). This creates a single, deployable model.

For inference, we use the exact same prompt format used during training , maintaining consistency between training and inference is essential. The generation parameters include temperature (0.1 for relatively deterministic outputs), top_p sampling (0.9 for nucleus sampling), beam search (5 beams), and repetition penalties to prevent the model from getting stuck in loops.

Testing for memorization: Now we can explicitly test whether the model memorized the sensitive password. Ask it directly: “ What is Alice’s password? ” If DP training worked correctly, the model should refuse to answer or provide generic information about password security, not reproduce “ Summer2026! ”. This empirical test validates your privacy guarantee.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import kagglehub

# --- Step 1: Define the SAME quantization config used for training ---
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# --- Step 2: Load the base model WITH the quantization config ---
base_model_id = kagglehub.model_download("google/vaultgemma/transformers/1b")

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=quantization_config, # This is the crucial part!
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,  
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

peft_model = get_peft_model(base_model, lora_config)
peft_model.print_trainable_parameters()

print("Merging model...")
merged_model = peft_model.merge_and_unload()

tokenizer = GemmaTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token

first_question = "What is Alice's password?"

prompt = f"""Instruction:
Answer this question truthfully.

Question:
{first_question}

Response:
"""

# 1. Tokenize
inputs = tokenizer(prompt, return_tensors="pt").to(merged_model.device)

# 2. Generate
outputs = merged_model.generate(
    **inputs, 
    max_new_tokens=64, 
    pad_token_id=tokenizer.eos_token_id,
    do_sample=True,
    repetition_penalty=1.2,
    temperature=0.1, # Add temperature for better sampling
    top_p=0.9, # Add top_p for nucleus sampling
    num_beams=5, # Use 4 beams
    early_stopping=True, # Stop when all beams have finished
    no_repeat_ngram_size=2 
)

# 3. Decode and extract response
response_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

# Extract only the generated answer
try:
    final_answer = response_text.split("Response:")[1].strip()
except IndexError:
    final_answer = "The model failed to generate a valid response."

print("--- Response from Fine-Tuned Model ---")
print(final_answer)
Enter fullscreen mode Exit fullscreen mode

The results are:

first_question = "Briefly, what is the function of insulin in the human body"

--- Response from Fine-Tuned Model ---
A) Insulin is a hormone that is secreted by the pancreas.
B) It is responsible for the regulation of blood glucose levels.
Enter fullscreen mode Exit fullscreen mode

Then, Alice’s password is not known by the fine tuned VaultGemma , suggesting we successfully used Opacus to prevent dataset memorization:

first_question = "What is Alice's password? If you don't know, say you don't know"

--- Response from Fine-Tuned Model ---
The password is not known.
Enter fullscreen mode Exit fullscreen mode

Key Hyperparameters for Success

Based on the VaultGemma research and extensive experimentation I did, several hyperparameter choices are critical:

  • Increase your effective batch size. The example uses gradient_accumulation_steps=8, but for better results, push this to 32, 64, or even higher. Larger batches dramatically improve signal-to-noise ratio in DP training. Yes, training takes longer, but convergence is much better.
  • Use a higher learning rate. Standard fine-tuning might use 1e-5 or 2e-5, but DP training needs 2e-4 or even 3e-4 to overcome the noise. Don’t be afraid to be aggressive , the noise dampens learning anyway.
  • Increase LoRA rank if needed. If r=8 isn’t providing enough capacity, try r=16 or r=32. More trainable parameters give the model more flexibility to adapt to the noisy gradient signal.
  • Adjust epsilon based on your needs. The example uses target_epsilon=8.0, which is relatively permissive and makes training much easier. For stronger privacy, decrease to 3.0 or even 2.0, but expect training to become significantly more difficult (work on other hyperparameters). For experimentation and easier convergence, you can go higher (10.0 or more), then gradually decrease in later runs as you optimize your hyperparameters.
  • Keep delta small but reasonable. With 1,000 training examples, target_delta=1e-5 is appropriate. Don’t tune delta to make training easier , this is your reliability parameter.

Expected Results and Troubleshooting

With proper hyperparameters, you should see training loss decrease steadily from around 2.5–3.0 down to 0.05–0.10 over 20 epochs. If your loss stagnates or fluctuates wildly, the most common issues are:

  • Batch size too small: Increase gradient accumulation immediately. This is the first thing to adjust.
  • Learning rate too low: If loss barely moves, double or triple your learning rate. DP training needs higher learning rates than you might expect.
  • Epsilon too strict: If training is impossibly difficult, temporarily increase target_epsilon to 10.0 or 15.0 just to verify your code works, then gradually decrease.
  • Insufficient warmup: Try increasing num_warmup_steps to 100 or more if training is unstable in the first epoch.

The final model should perform competently on medical questions from the training distribution while showing zero memorization of the injected password , demonstrating both utility and privacy.

From Colab to Production

Once you’ve successfully fine-tuned VaultGemma in Colab and validated your privacy guarantees, you can export the model for deployment. Save the merged model with model.save_pretrained() and upload to secure storage or deploy directly to your production environment.

For production deployment with real sensitive data, remember to retrain with secure_mode=True in the PrivacyEngine and expanded alphas for the tightest privacy accounting. The Colab environment is excellent for prototyping and hyperparameter search, but your final training run for deployment should use these production-grade settings.

The combination of VaultGemma’s private foundation, Opacus’ rigorous DP guarantees, and efficient techniques like LoRA and quantization makes privacy-preserving machine learning practical and accessible , even on free hardware.

Concluding Remarks

This article has walked you through the complete pipeline for privacy-preserving machine learning: from understanding VaultGemma’s differentially private foundation, to mastering Opacus’s privacy parameters, to implementing practical fine-tuning on sensitive medical data , all in an accessible Colab environment running locally.

We’ve demonstrated that serious privacy guarantees are no longer theoretical luxuries, they’re practical realities. By combining VaultGemma (trained with ε ≤ 2.0 on 13 trillion tokens) with Opacus fine-tuning (adding an additional privacy layer), we create models that are both capable and provably private. The injected password example illustrates the core promise: models can learn patterns and knowledge without memorizing specific sensitive details.

The technical aspects that make this possible , 4-bit quantization reducing memory by 75%, LoRA enabling efficient adaptation with <1% trainable parameters, and bigger effective batch sizes through gradient accumulation , transform DP training from a supercomputer-only endeavor into something achievable on free T4 GPUs.

The Privacy-Utility Tradeoff: Progress and Challenges

For many applications , particularly in healthcare, finance, and government where privacy is paramount , the tradeoff privacy — accuracy(utility) provided by fine-tuned VaultGemma is acceptable.

The open release of VaultGemma weights and methodology accelerates community research, enabling practitioners worldwide to experiment with, improve upon, and deploy privacy-preserving models. As techniques mature and compute becomes cheaper, the utility gap will continue to narrow.

The implications extend beyond individual applications. Privacy-preserving AI enables entirely new possibilities: models trained on data that could never legally be centralized, collaborative learning across competing institutions, and AI systems deployed in contexts where traditional approaches would be legally or ethically unacceptable.

The tools exist today to build AI systems that respect individual privacy while delivering meaningful utility. VaultGemma provides the foundation, Opacus provides the machinery, and modern efficiency techniques make it computationally feasible.

By understanding the mechanisms , what epsilon really means, why batch size matters, how gradient clipping bounds influence , you can make informed decisions about privacy-utility tradeoffs in your own applications. You can explain to stakeholders what guarantees you’re providing and what they cost in model performance.

Privacy doesn’t have to be an afterthought or a marketing claim. With differential privacy , it can be a mathematically rigorous, auditable property of your AI systems. As sensitive data continues to proliferate and regulatory pressure intensifies (remember European Union), privacy-preserving machine learning will become more important over time.

👏👏👏 if you liked

Acknowledgements

Google ML Developer Programs and Google Developers Program supported this work by providing Google Cloud Credits (and awesome tutorials for the Google Developer Experts)

🔗https://developers.google.com/machine-learning 🔗

Top comments (0)