<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shir Meir Lador</title>
    <description>The latest articles on DEV Community by Shir Meir Lador (@shirmeirlador).</description>
    <link>https://dev.to/shirmeirlador</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3596246%2F7fba9a43-cbe3-4af2-adff-1871187ffbf8.jpeg</url>
      <title>DEV Community: Shir Meir Lador</title>
      <link>https://dev.to/shirmeirlador</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shirmeirlador"/>
    <language>en</language>
    <item>
      <title>Fine-Tuning Gemma 3 with Cloud Run Jobs: Serverless GPUs (NVIDIA RTX 6000 Pro) for pet breed classification 🐈🐕</title>
      <dc:creator>Shir Meir Lador</dc:creator>
      <pubDate>Thu, 09 Apr 2026 13:07:00 +0000</pubDate>
      <link>https://dev.to/googleai/fine-tuning-gemma-3-with-cloud-run-jobs-serverless-gpus-nvidia-rtx-6000-pro-for-pet-breed-248b</link>
      <guid>https://dev.to/googleai/fine-tuning-gemma-3-with-cloud-run-jobs-serverless-gpus-nvidia-rtx-6000-pro-for-pet-breed-248b</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr33mdn056bnbis88u9kj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr33mdn056bnbis88u9kj.png" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;small&gt;Architectural worklow: fine tuning Gemma 3 27B on Cloud Run Jobs&lt;/small&gt;&lt;/center&gt;

&lt;p&gt; &lt;/p&gt;

&lt;p&gt;Recently, I was inspired by a major new release on Google Cloud: the availability of &lt;strong&gt;&lt;a href="https://cloud.google.com/blog/products/serverless/cloud-run-supports-nvidia-rtx-6000-pro-gpus-for-ai-workloads?utm_campaign=CDR_0x91b1edb5_default_b488149523&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs&lt;/a&gt;&lt;/strong&gt; on &lt;a href="https://docs.cloud.google.com/run/docs/create-jobs" rel="noopener noreferrer"&gt;Cloud Run Jobs&lt;/a&gt;. This launch is important because it unlocks the ability to tackle fine-tuning workloads for open models with the simplicity of a serverless batch job. To put this new hardware to the test in a fun way, I fine tuned a multi-modal model to identify a pet’s breed from a photo using &lt;a href="https://www.robots.ox.ac.uk/~vgg/data/pets/" rel="noopener noreferrer"&gt;The Oxford-IIIT Pet Dataset&lt;/a&gt;. This model could be used for a “Smart pet care” — an AI application that identifies a pet’s breed from a photo and provides tailored health and nutrition advice.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3p12qsqvyysokppob26f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3p12qsqvyysokppob26f.png" width="800" height="370"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;small&gt;Image taken from &lt;a href="https://www.robots.ox.ac.uk/~vgg/data/pets/" rel="noopener noreferrer"&gt;The Oxford-IIIT Pet Dataset&lt;/a&gt; and showcase the images of cats and dogs and their corresponding breed — the classification label&lt;/small&gt;&lt;/center&gt;

&lt;p&gt; &lt;/p&gt;

&lt;h3&gt;
  
  
  Why Fine-Tuning?
&lt;/h3&gt;

&lt;p&gt;In a recent &lt;a href="https://www.youtube.com/watch?v=qBOvM7SiDa4" rel="noopener noreferrer"&gt;Agent Factory episode&lt;/a&gt;, we discussed that while foundational models are a powerful ‘one-size-fits-all’ starting point, they essentially remain generalists. You should consider fine-tuning when you have a problem that requires &lt;strong&gt;high specialization&lt;/strong&gt; that a generalist model might not excel in on its own, or when you need more &lt;strong&gt;control&lt;/strong&gt; and &lt;strong&gt;cost-efficiency&lt;/strong&gt; over your own hosting.&lt;/p&gt;

&lt;p&gt;For this pet-care use case, distinguishing between 37 different breeds isn’t just about ‘knowledge’, it’s about taking that foundational reasoning and adding a specific capability based on a unique dataset. As we explored in the episode and as mentioned in this &lt;a href="https://arxiv.org/pdf/2506.02153" rel="noopener noreferrer"&gt;Nvidia paper&lt;/a&gt;, this kind of specialization is what allows smaller, focused models to become &lt;strong&gt;sufficiently powerful&lt;/strong&gt; and &lt;strong&gt;economical&lt;/strong&gt; for production agentic systems. Fine-tuning acts as the necessary bridge, transforming a broad reasoner into a high-precision classification expert.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bridging Reasoning and Precision
&lt;/h3&gt;

&lt;p&gt;For this project, I chose the multimodal breadth of &lt;a href="https://huggingface.co/google/gemma-3-27b-it" rel="noopener noreferrer"&gt;Gemma 3 27B&lt;/a&gt;. While specialized vision models often provide superior accuracy for narrow identification tasks, I wanted to use a model capable of both identifying breeds and reasoning about the specific health and dietary needs associated with them. By leveraging the power of the new &lt;a href="https://cloud.google.com/blog/products/serverless/cloud-run-supports-nvidia-rtx-6000-pro-gpus-for-ai-workloads?e=48754805" rel="noopener noreferrer"&gt;Blackwell GPUs&lt;/a&gt;, I was able to fine-tune this model to bridge the performance gap, all while keeping the setup &lt;strong&gt;reproducible, cost-effective,&lt;/strong&gt; and entirely &lt;strong&gt;container-native.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  From Batch to Production: Economically Efficient Hosting
&lt;/h3&gt;

&lt;p&gt;The true ‘deploy and forget’ magic happens after the weights are saved. With high-performance inference &lt;a href="https://cloud.google.com/blog/products/serverless/cloud-run-supports-nvidia-rtx-6000-pro-gpus-for-ai-workloads?e=48754805&amp;amp;utm_campaign=CDR_0x91b1edb5_default_b488149523&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;now supported&lt;/a&gt; on Cloud Run, you can host your fine-tuned Gemma 3 27B model on the same NVIDIA RTX PRO 6000 Blackwell GPU without managing any underlying infrastructure. This setup delivers a highly economical production environment: Cloud Run automatically &lt;strong&gt;scales your GPU instances to zero&lt;/strong&gt; when they aren’t in use, ensuring you only pay for the exact minutes your model is active.&lt;/p&gt;

&lt;p&gt;In this guide, I’m excited to show you how this new hardware release transforms complex fine-tuning into a scalable, serverless experience without the need to manage complex clusters or maintain idle instances.&lt;/p&gt;

&lt;h2&gt;
  
  
  Simplifying 27B Fine-Tuning on Cloud Run
&lt;/h2&gt;

&lt;p&gt;Fine-tuning an open model can seem like a daunting task that requires complex orchestration, from provisioning high-capacity VMs and manually installing CUDA drivers to managing tedious data transfers and scaling down manually to control costs. &lt;a href="https://docs.cloud.google.com/run/docs/create-jobs" rel="noopener noreferrer"&gt;Cloud Run Jobs&lt;/a&gt; elegantly solves this by allowing you to package your training logic as a container, now backed by the fully managed environment of &lt;a href="https://cloud.google.com/blog/products/serverless/cloud-run-supports-nvidia-rtx-6000-pro-gpus-for-ai-workloads" rel="noopener noreferrer"&gt;&lt;strong&gt;NVIDIA RTX PRO 6000 Blackwell GPUs&lt;/strong&gt;&lt;/a&gt; and their 96GB of VRAM.&lt;/p&gt;

&lt;p&gt;This setup delivers on-demand availability without the need for reservations, rapid 5-second startup times with drivers pre-installed, and automatic scale-to-zero efficiency that ensures you only pay for the minutes your model is training. By leveraging built-in GCS volume mounting for high-speed access to model weights, we can now move past infrastructure hurdles and focus on the core task: fine-tuning Gemma 3 27B to achieve high-precision results for &lt;strong&gt;Pet Breed Classification&lt;/strong&gt; on the &lt;a href="https://www.robots.ox.ac.uk/~vgg/data/pets/" rel="noopener noreferrer"&gt;Oxford-IIIT Pet Dataset&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you’d like to dive straight into the code, you can clone the repository &lt;a href="https://github.com/GoogleCloudPlatform/devrel-demos/tree/main/ai-ml/finetune_gemma" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before you begin the fine-tuning process, ensure you have the following software and environment configurations in place.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Python 3.12+&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.astral.sh/uv/getting-started/installation/#standalone-installer" rel="noopener noreferrer"&gt;&lt;strong&gt;uv&lt;/strong&gt;&lt;/a&gt; (Python package manager): will be used to manage our local Python environment and speed up our Docker builds. Use curl to download the script and execute it with sh:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-LsSf&lt;/span&gt; https://astral.sh/uv/install.sh | sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://cloud.google.com/sdk/docs/install" rel="noopener noreferrer"&gt;&lt;strong&gt;Google Cloud SDK&lt;/strong&gt;&lt;/a&gt; (gcloud CLI) installed and authenticated.&lt;/li&gt;
&lt;li&gt;A &lt;a href="https://docs.cloud.google.com/resource-manager/docs/creating-managing-projects" rel="noopener noreferrer"&gt;&lt;strong&gt;Google Cloud Project&lt;/strong&gt;&lt;/a&gt; with billing enabled.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.cloud.google.com/endpoints/docs/openapi/enable-api" rel="noopener noreferrer"&gt;APIs Enabled&lt;/a&gt; Ensure the following APIs are active in your project: Cloud Run Admin API, Artifact Registry API, Cloud Build API, Secret Manager API, Compute Engine API (for GPU provisioning)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/docs/hub/en/security-tokens" rel="noopener noreferrer"&gt;Hugging Face Token&lt;/a&gt;: A valid token with access to the &lt;a href="https://huggingface.co/google/gemma-3-27b-it" rel="noopener noreferrer"&gt;Gemma 3 27B-IT&lt;/a&gt; model weights.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Access to gated models:&lt;/strong&gt; &lt;a href="https://huggingface.co/google/gemma-3-27b-it" rel="noopener noreferrer"&gt;Gemma 3 27B-IT&lt;/a&gt; is a gated model, which means you must explicitly accept the terms of use before you can download or fine-tune the weights.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Accept the License:&lt;/strong&gt; Visit the &lt;a href="https://huggingface.co/google/gemma-3-27b-it" rel="noopener noreferrer"&gt;Gemma 3 27B-IT&lt;/a&gt; model page on Hugging Face and click the “Agree and access repository” button.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generate a Token:&lt;/strong&gt; Once access is &lt;a href="https://huggingface.co/docs/hub/en/security-tokens" rel="noopener noreferrer"&gt;granted&lt;/a&gt;, ensure your Hugging Face Token has “read” permissions (or “write” if you plan to push your fine-tuned model back to the Hub) to authenticate your training job.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Step 1 — Setting the stage: Your environment
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1.1 — Prepare your Google Cloud environment
&lt;/h3&gt;

&lt;p&gt;Set environment variables.&lt;/p&gt;

&lt;p&gt;[!IMPORTANT] &lt;strong&gt;Regional Alignment is Critical:&lt;/strong&gt; To use Cloud Storage volume mounting, your GCS bucket &lt;strong&gt;must&lt;/strong&gt; be in the same region as your Cloud Run job. We recommend using europe-west4 (Netherlands) as it supports the RTX PRO 6000 Blackwell GPU and ensures zero-latency access to your model weights.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;YOUR_PROJECT_ID
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;REGION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;europe-west4
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;HF_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;YOUR_HF_TOKEN
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;SERVICE_ACCOUNT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"finetune-gemma-job-sa"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;BUCKET_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;&lt;span class="nt"&gt;-gemma3-finetuning-eu&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AR_REPO&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gemma3-finetuning-repo
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;SECRET_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;HF_TOKEN
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;IMAGE_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gemma3-finetune
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;JOB_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gemma3-finetuning-job
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 1.2 — Get the code
&lt;/h3&gt;

&lt;p&gt;Whether you’re running locally or on the cloud, you’ll need the code. After you open Cloud Shell or install your local Google Cloud CLI, you need to clone the repository. The finetune_gemma repository contains the finetune_and_evaluate.py script, a Dockerfile, and the requirements.txt file to your machine.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/GoogleCloudPlatform/devrel-demos
&lt;span class="nb"&gt;cd &lt;/span&gt;devrel-demos/ai-ml/finetune_gemma/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Login to gcloud (this is required to run gcloud commands authorize the CLI tool):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud auth login
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set your Project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud config &lt;span class="nb"&gt;set &lt;/span&gt;project &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create the service account and grant storage permissions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud iam service-accounts create &lt;span class="nv"&gt;$SERVICE_ACCOUNT&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--display-name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Service Account for Gemma 3 fine-tuning"&lt;/span&gt;

gcloud storage buckets create gs://&lt;span class="nv"&gt;$BUCKET_NAME&lt;/span&gt; &lt;span class="nt"&gt;--location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$REGION&lt;/span&gt;

gcloud storage buckets add-iam-policy-binding gs://&lt;span class="nv"&gt;$BUCKET_NAME&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;serviceAccount:&lt;span class="nv"&gt;$SERVICE_ACCOUNT&lt;/span&gt;@&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;.iam.gserviceaccount.com &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;roles/storage.objectAdmin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create an Artifact Registry repository and store your HF Token in Secret Manager:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud artifacts repositories create &lt;span class="nv"&gt;$AR_REPO&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--repository-format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;docker &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$REGION&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Gemma 3 finetuning repository"&lt;/span&gt;

&lt;span class="c"&gt;# Create the secret (ignore error if it already exists)&lt;/span&gt;
gcloud secrets create &lt;span class="nv"&gt;$SECRET_ID&lt;/span&gt; &lt;span class="nt"&gt;--replication-policy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"automatic"&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;true

printf&lt;/span&gt; &lt;span class="nv"&gt;$HF_TOKEN&lt;/span&gt; | gcloud secrets versions add &lt;span class="nv"&gt;$SECRET_ID&lt;/span&gt; &lt;span class="nt"&gt;--data-file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;-

gcloud secrets add-iam-policy-binding &lt;span class="nv"&gt;$SECRET_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--member&lt;/span&gt; serviceAccount:&lt;span class="nv"&gt;$SERVICE_ACCOUNT&lt;/span&gt;@&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;.iam.gserviceaccount.com &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'roles/secretmanager.secretAccessor'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 2 — Staging the Model with cr-infer (Recommended)
&lt;/h2&gt;

&lt;p&gt;To avoid downloading the model every time the job runs, we’ll stage the &lt;strong&gt;Gemma 3 27B&lt;/strong&gt; weights in Google Cloud Storage. We’ll use &lt;a href="https://github.com/oded996/cr-infer" rel="noopener noreferrer"&gt;&lt;strong&gt;cr-infer&lt;/strong&gt;&lt;/a&gt;, which allows you to run model transfers directly via uvx without needing a local installation.&lt;/p&gt;

&lt;p&gt;Before running the transfer, you must set up your Application Default Credentials. This is required for running scripts locally. In this case it allows the cr-infer tool to use your local identity to write the weights to your GCS bucket.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud auth application-default login
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Download Gemma 3 27B to GCS&lt;/strong&gt;: Now, execute the transfer using uvx. This clones the model into gs://$BUCKET_NAME/google/gemma-3–27b-it/, allowing our Cloud Run job to mount the weights as a local volume and save gigabytes of container startup time&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uvx — from git+https://github.com/oded996/cr-infer.git cr-infer model download &lt;span class="se"&gt;\-&lt;/span&gt; &lt;span class="nb"&gt;source &lt;/span&gt;huggingface &lt;span class="se"&gt;\&lt;/span&gt;
 - model-id google/gemma-3–27b-it &lt;span class="se"&gt;\&lt;/span&gt;
 - bucket &lt;span class="nv"&gt;$BUCKET_NAME&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
 - token &lt;span class="nv"&gt;$HF_TOKEN&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 3 — Build and push the container image
&lt;/h2&gt;

&lt;p&gt;Our Dockerfile leverages &lt;strong&gt;uv&lt;/strong&gt; for fast dependency installation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option A: Use Google Cloud Build (Recommended — No local Docker needed)
&lt;/h3&gt;

&lt;p&gt;This is the easiest way to build your image directly in the cloud and push it to Artifact Registry. (The build typically takes &lt;strong&gt;10–15 minutes&lt;/strong&gt; as it downloads large ML dependencies like PyTorch).&lt;/p&gt;

&lt;p&gt;gcloud builds submit — tag $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME:latest .&lt;/p&gt;

&lt;p&gt;[!TIP] You can track the real-time progress of your build in the &lt;a href="https://console.cloud.google.com/cloud-build/builds" rel="noopener noreferrer"&gt;Cloud Build console&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option B: Build locally with Docker
&lt;/h3&gt;

&lt;p&gt;If you have Docker Desktop installed locally:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Install uv locally&lt;/strong&gt; (if you haven’t already):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-LsSf&lt;/span&gt; https://astral.sh/uv/install.sh | sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Build the image:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-t&lt;/span&gt; &lt;span class="nv"&gt;$IMAGE_NAME&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Push to AR:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker tag &lt;span class="nv"&gt;$IMAGE_NAME&lt;/span&gt; &lt;span class="nv"&gt;$REGION&lt;/span&gt;&lt;span class="nt"&gt;-docker&lt;/span&gt;.pkg.dev/&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;/&lt;span class="nv"&gt;$AR_REPO&lt;/span&gt;/&lt;span class="nv"&gt;$IMAGE_NAME&lt;/span&gt;
docker push &lt;span class="nv"&gt;$REGION&lt;/span&gt;&lt;span class="nt"&gt;-docker&lt;/span&gt;.pkg.dev/&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;/&lt;span class="nv"&gt;$AR_REPO&lt;/span&gt;/&lt;span class="nv"&gt;$IMAGE_NAME&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3.1 — Test locally (Optional)
&lt;/h3&gt;

&lt;p&gt;I like to start with a quick local test run to validate the setup. It serves as a sanity check for your environment and scripts before moving the workload to Cloud Run. For this test, we use parameters optimized for speed and a smaller model, google/gemma-3–4b-it, to ensure the model correctly learns the task format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 finetune_and_evaluate.py &lt;span class="se"&gt;\&lt;/span&gt;
- model-id google/gemma-3–4b-it &lt;span class="se"&gt;\&lt;/span&gt;
 - train-size 20 &lt;span class="se"&gt;\&lt;/span&gt;
 - eval-size 20 &lt;span class="se"&gt;\&lt;/span&gt;
 - gradient-accumulation-steps 2 &lt;span class="se"&gt;\&lt;/span&gt;
 - learning-rate 2e-4 &lt;span class="se"&gt;\&lt;/span&gt;
 - batch-size 1 &lt;span class="se"&gt;\&lt;/span&gt;
 - num-epochs 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On my Apple M4 Pro, running this on the CPU took about &lt;strong&gt;20–30 minutes.&lt;/strong&gt; If you want to see early signs of progress locally, you can increase the sample size — I found that a one-hour run on my Mac with 50 training and testing samples already yielded a 4% improvement in accuracy and a 3% boost in F1-score.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffmlpzzou35x4bwnh8wiv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffmlpzzou35x4bwnh8wiv.png" width="800" height="174"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;small&gt;Results from a local run on my Mac with 50 train and 50 test samples&lt;/small&gt;&lt;/center&gt;

&lt;p&gt; &lt;/p&gt;

&lt;h2&gt;
  
  
  Inside the Fine-Tuning Script: How it Works
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://github.com/GoogleCloudPlatform/devrel-demos/blob/main/ai-ml/finetune_gemma/finetune_and_evaluate.py" rel="noopener noreferrer"&gt;finetune_and_evaluate&lt;/a&gt;.py script is designed to be a complete, self-contained pipeline, handling everything from data preparation to hardware-aware optimization and evaluation. Here is a look at the core logic that makes this possible:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Memory-Efficient Model Loading
&lt;/h3&gt;

&lt;p&gt;To fit a 27B parameter model into the 96GB VRAM of the Blackwell GPU, the script uses 4-bit quantization via the &lt;a href="https://github.com/bitsandbytes-foundation/bitsandbytes" rel="noopener noreferrer"&gt;bitsandbytes&lt;/a&gt; library. By setting low_cpu_mem_usage=True, it also ensures the model is loaded efficiently without exhausting the system RAM.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Vision-Language LoRA Configuration
&lt;/h3&gt;

&lt;p&gt;Instead of updating all 27 billion parameters, we use LoRA (Low-Rank Adaptation). We target all the primary projection layers in the transformer blocks, allowing the model to adapt its internal representations to the visual nuances of the pet breeds while keeping the total trainable parameter count extremely low. More details on efficient GPU memory usage can be found in this &lt;a href="https://cloud.google.com/blog/topics/developers-practitioners/decoding-high-bandwidth-memory-a-practical-guide-to-gpu-memory-for-fine-tuning-ai-models/?e=48754805" rel="noopener noreferrer"&gt;blog&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The Custom Data Collator
&lt;/h3&gt;

&lt;p&gt;This is a crucial part for fine-tuning vision-language models (VLMs). Because VLMs process a mix of image and text tokens, the data_collator ensures that the model only learns from the breed label (the model’s response). The &lt;em&gt;turn marker&lt;/em&gt; is a structural boundary that signals the exact point where the user stops speaking and the model’s response begins. The script ensures the model learns only from the breed label by searching for the model’s &lt;em&gt;turn marker&lt;/em&gt; in the token sequence and masking out the user’s prompt and image tokens, so they don’t contribute to the training loss.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Breed Extraction
&lt;/h3&gt;

&lt;p&gt;Generative models often add conversational filler (e.g., “The animal in this image is a Samoyed”). Our evaluation logic includes a robust extraction heuristic that sorts class names by length. This ensures that if the model mentions “English Cocker Spaniel,” it correctly identifies the full breed rather than just matching “Cocker Spaniel”.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Automated GCS Archiving
&lt;/h3&gt;

&lt;p&gt;Once the training completes and the final evaluation is calculated, the script doesn’t just stop. It bundles the fine-tuned LoRA adapters with the original model processor and automatically uploads the entire directory to your Google Cloud Storage bucket. This ensures your model is immediately ready for deployment or serving.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4 — Create and execute the Cloud Run job
&lt;/h2&gt;

&lt;p&gt;Now, we harness the power of the &lt;strong&gt;NVIDIA RTX PRO 6000 Blackwell GPU.&lt;/strong&gt; Our container is built with &lt;strong&gt;CUDA 12.8&lt;/strong&gt; for full Blackwell/PyTorch 2.7 compatibility and uses an ENTRYPOINT configuration, allowing you to pass script arguments directly via the — args flag.&lt;/p&gt;

&lt;p&gt;[!TIP] &lt;strong&gt;If the job already exists&lt;/strong&gt;, use gcloud beta run jobs update instead of create.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud beta run &lt;span class="nb"&gt;jobs &lt;/span&gt;create &lt;span class="nv"&gt;$JOB_NAME&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
 - region &lt;span class="nv"&gt;$REGION&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
 - image &lt;span class="nv"&gt;$REGION&lt;/span&gt;&lt;span class="nt"&gt;-docker&lt;/span&gt;.pkg.dev/&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;/&lt;span class="nv"&gt;$AR_REPO&lt;/span&gt;/&lt;span class="nv"&gt;$IMAGE_NAME&lt;/span&gt;:latest &lt;span class="se"&gt;\&lt;/span&gt;
 - set-env-vars &lt;span class="nv"&gt;BUCKET_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$BUCKET_NAME&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
 - set-secrets &lt;span class="nv"&gt;HF_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$SECRET_ID&lt;/span&gt;:latest &lt;span class="se"&gt;\&lt;/span&gt;
 - no-gpu-zonal-redundancy &lt;span class="se"&gt;\&lt;/span&gt;
 - cpu 20.0 &lt;span class="se"&gt;\&lt;/span&gt;
 - memory 80Gi &lt;span class="se"&gt;\&lt;/span&gt;
 - task-timeout 60m &lt;span class="se"&gt;\&lt;/span&gt;
 - gpu 1 &lt;span class="se"&gt;\&lt;/span&gt;
 - gpu-type nvidia-rtx-pro-6000 &lt;span class="se"&gt;\&lt;/span&gt;
 - service-account &lt;span class="nv"&gt;$SERVICE_ACCOUNT&lt;/span&gt;@&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;.iam.gserviceaccount.com &lt;span class="se"&gt;\&lt;/span&gt;
 - add-volume &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;model-volume,type&lt;span class="o"&gt;=&lt;/span&gt;cloud-storage,bucket&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$BUCKET_NAME&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
 - add-volume-mount &lt;span class="nv"&gt;volume&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;model-volume,mount-path&lt;span class="o"&gt;=&lt;/span&gt;/mnt/gcs &lt;span class="se"&gt;\&lt;/span&gt;
 - &lt;span class="nv"&gt;network&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;default &lt;span class="se"&gt;\&lt;/span&gt;
 - &lt;span class="nv"&gt;subnet&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;default &lt;span class="se"&gt;\&lt;/span&gt;
 - vpc-egress&lt;span class="o"&gt;=&lt;/span&gt;private-ranges-only &lt;span class="se"&gt;\&lt;/span&gt;
 - &lt;span class="nv"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;" - model-id"&lt;/span&gt;,&lt;span class="s2"&gt;"/mnt/gcs/google/gemma-3–27b-it/"&lt;/span&gt;,&lt;span class="s2"&gt;" - output-dir"&lt;/span&gt;,&lt;span class="s2"&gt;"/tmp/gemma3-finetuned"&lt;/span&gt;,&lt;span class="s2"&gt;" - gcs-output-path"&lt;/span&gt;,&lt;span class="s2"&gt;"gs://&lt;/span&gt;&lt;span class="nv"&gt;$BUCKET_NAME&lt;/span&gt;&lt;span class="s2"&gt;/gemma3-finetuned"&lt;/span&gt;,&lt;span class="s2"&gt;" - train-size"&lt;/span&gt;,&lt;span class="s2"&gt;"800"&lt;/span&gt;,&lt;span class="s2"&gt;" - eval-size"&lt;/span&gt;,&lt;span class="s2"&gt;"200"&lt;/span&gt;,&lt;span class="s2"&gt;" - learning-rate"&lt;/span&gt;,&lt;span class="s2"&gt;"5e-5"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Note on Execution Limits:&lt;/strong&gt; Tasks using GPUs on Cloud Run Jobs currently have a maximum execution time of &lt;strong&gt;60 minutes&lt;/strong&gt;. To ensure this training job completes within the standard public limit, we have set the — num_epochs to 3 and restricted the — train-size to 800 samples. If your specific fine-tuning workload requires more time, you can sample your training dataset into segments that fit in under 60 minutes (like 800 samples in our case) and process them as a sequence of independent tasks while using checkpointing for the model training.&lt;/p&gt;

&lt;h3&gt;
  
  
  Understanding the Deployment Flags
&lt;/h3&gt;

&lt;p&gt;To ensure a stable and production-ready environment, we use several specialized flags:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;— gpu-type nvidia-rtx-pro-6000:&lt;/strong&gt; Targets the NVIDIA RTX PRO 6000 Blackwell GPU. With &lt;strong&gt;96GB of GPU memory (VRAM), 1.6 TB/s bandwidth,&lt;/strong&gt; and support for &lt;strong&gt;FP4/FP6 precision,&lt;/strong&gt; it provides the ample overhead and high-speed throughput needed for multimodal fine-tuning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;— memory 80Gi:&lt;/strong&gt; We allocate high system RAM (scalable up to 176GB) to handle the low_cpu_mem_usage model loading and our memory-efficient streaming data generator.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;— cpu 20.0:&lt;/strong&gt; Cloud Run Jobs allows scaling up to &lt;strong&gt;44 vCPUs&lt;/strong&gt; per instance, ensuring that preprocessing and data loading never become a bottleneck for the GPU.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;— add-volume &amp;amp; — add-volume-mount:&lt;/strong&gt; This mounts your GCS bucket as a local directory at /mnt/gcs. &lt;strong&gt;Note:&lt;/strong&gt; This requires the bucket and the job to be in the same region (europe-west4). It allows the script to read the base model weights at data-center speeds without copying them into the container’s writable layer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;— network &amp;amp; — subnet:&lt;/strong&gt; Configures &lt;strong&gt;Direct VPC Egress&lt;/strong&gt;, allowing the job to communicate securely with other resources in your VPC. To make sure this works you need to enable &lt;a href="https://docs.cloud.google.com/vpc/docs/configure-private-google-access" rel="noopener noreferrer"&gt;“Private Google Access”&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;— vpc-egress=all-traffic:&lt;/strong&gt; Ensures all outgoing traffic, including requests to Hugging Face, is routed through your VPC for enhanced security and monitoring.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;[!TIP] If you skipped Step 2 and didn’t stage the model in your GCS bucket, you must change the — model-id in the — args to google/gemma-3–27b-it. This tells the script to download the weights directly from Hugging Face at runtime, though this will be significantly slower than using the GCS mount&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Execute the job:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud beta run &lt;span class="nb"&gt;jobs &lt;/span&gt;execute &lt;span class="nv"&gt;$JOB_NAME&lt;/span&gt; — region &lt;span class="nv"&gt;$REGION&lt;/span&gt; — async
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 5 — Check Results and Evaluate Performance
&lt;/h2&gt;

&lt;p&gt;Once your job finishes, you can jump into the Google Cloud Console to inspect the detailed logs. You’ll find your newly fine-tuned model waiting for you in your Cloud Storage bucket at gs://$BUCKET_NAME/gemma3-finetuned.&lt;/p&gt;

&lt;p&gt;To rigorously quantify how well Gemma 3 learned to identify these breeds, we used Accuracy and Macro F1 Score as our primary metrics. While accuracy gives us a clear overall percentage, the F1 score ensures the model is accurate across all 37 breeds, not just the most common ones.&lt;/p&gt;

&lt;p&gt;In my testing, I saw a clear progression as we scaled our data and compute:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsv9ukl6ye7kuva89099k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsv9ukl6ye7kuva89099k.png" width="800" height="481"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;small&gt;Results with different sample size&lt;/small&gt;&lt;/center&gt;

&lt;p&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;79% Accuracy, 77% F1-score (1.1h run):&lt;/strong&gt; Trained on 1,000 samples and evaluated against 200 test samples, this was a significant jump from the zero-shot baseline of 66%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;93% Accuracy, 91% F1-score (2.3h run):&lt;/strong&gt; By scaling up to 2,500 training samples (and 1,500 test samples), the model reached nearly state-of-the-art performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;94% Accuracy &amp;amp; 91.5% F1 (3.3h run):&lt;/strong&gt; With a larger run on 3,600 training samples (evaluated against 3,500 test samples), the model effectively hit the state-of-the-art benchmark for this dataset.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbzldj4ngizd6okblrtry.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbzldj4ngizd6okblrtry.png" width="800" height="397"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;small&gt;Performance summary report for 3600 train samples and 3500 test sample — reached state of the art with &lt;strong&gt;94% accuracy!&lt;/strong&gt;&lt;/small&gt;&lt;/center&gt;

&lt;p&gt; &lt;/p&gt;

&lt;p&gt;It is important to note that the standard &lt;strong&gt;public limit&lt;/strong&gt; for GPU jobs is currently 60 minutes. As mentioned in step 4, sampling and &lt;a href="https://huggingface.co/docs/trl/sft_trainer#trl.SFTTrainer.train.resume_from_checkpoint" rel="noopener noreferrer"&gt;checkpointing&lt;/a&gt; can help overcome this limitation.&lt;/p&gt;

&lt;p&gt;These results prove that fine-tuning is the necessary bridge for generalist models, by leveraging serverless Blackwell GPUs, we’ve transformed a massive reasoner into a high-precision expert ready for production&lt;/p&gt;

&lt;h3&gt;
  
  
  Next Steps: Serving your fine-tuned model on Cloud Run
&lt;/h3&gt;

&lt;p&gt;Now that you’ve fine-tuned Gemma 3, the next challenge is serving it efficiently for production-grade inference.&lt;/p&gt;

&lt;p&gt;The true “deploy and forget” magic happens when you transition your saved weights into a serving environment. By hosting your fine-tuned model on Cloud Run with serverless Blackwell GPUs, you get a highly economical production environment where your GPU instances automatically scale to zero when they aren’t in use. This setup eliminates the operational toil of cluster management and manual maintenance, allowing you to serve massive models with no reservations, you only pay for the exact minutes your model is active.&lt;/p&gt;

&lt;p&gt;To get started with inference, explore this codelab: &lt;a href="https://codelabs.developers.google.com/codelabs/cloud-run/cloud-run-gpu-rtx-pro-6000" rel="noopener noreferrer"&gt;Run inference using a Gemma model on Cloud Run with RTX 6000 Pro GPU&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To learn more about production serving, refer to the official guide on &lt;a href="https://docs.cloud.google.com/run/docs/run-gemma-on-cloud-run" rel="noopener noreferrer"&gt;Running Gemma 3 on Cloud Run&lt;/a&gt;. The documentation provides a comprehensive roadmap for building a robust inference service, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Optimized Deployment:&lt;/strong&gt; Instructions for serving Gemma models using GPU accelerators and loading model weights via high-speed Cloud Storage volume mounts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secure Interaction:&lt;/strong&gt; Guidance on using IAM authentication to securely call your deployed service with the Google Gen AI SDK.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance Configuration:&lt;/strong&gt; Best practices for setting concurrency to achieve optimal request latency and high GPU utilization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Special thanks to Sara Ford and Oded Shahar from the Cloud Run team for the helpful review and feedback on this article.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>nvidia</category>
      <category>ai</category>
      <category>gemma</category>
      <category>serverless</category>
    </item>
    <item>
      <title>Agent Factory Recap: Supercharging Agents on GKE with Agent Sandbox and Pod Snapshots</title>
      <dc:creator>Shir Meir Lador</dc:creator>
      <pubDate>Tue, 07 Apr 2026 13:04:00 +0000</pubDate>
      <link>https://dev.to/googleai/agent-factory-recap-supercharging-agents-on-gke-with-agent-sandbox-and-pod-snapshots-3a5e</link>
      <guid>https://dev.to/googleai/agent-factory-recap-supercharging-agents-on-gke-with-agent-sandbox-and-pod-snapshots-3a5e</guid>
      <description>&lt;p&gt;In the latest episode of the &lt;a href="https://www.youtube.com/playlist?list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs" rel="noopener noreferrer"&gt;Agent Factory&lt;/a&gt;, Mofi Rahman and I had the pleasure of hosting, Brandon Royal, the PM working on agentic workloads on GKE. We dove deep into the critical questions around the nuances of choosing the right agent runtime, the power of GKE for agents, and the essential security measures needed for intelligent agents to run code.&lt;/p&gt;

&lt;p&gt;This post guides you through the key ideas from our conversation. Use it to quickly recap topics or dive deeper into specific segments with links and timestamps.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why GKE for Agents?
&lt;/h2&gt;

&lt;p&gt;Timestamp: &lt;a href="https://www.youtube.com/watch?v=5_R_Ixk8ENQ&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=1&amp;amp;t=109s" rel="noopener noreferrer"&gt;01:49&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;We kicked off our discussion by tackling a fundamental question: why choose GKE as your agent runtime when serverless options like Cloud Run or fully managed solutions like Agent Engine exist?&lt;/p&gt;

&lt;p&gt;Brandon explained that the decision often boils down to control versus convenience. While serverless options are perfectly adequate for basic agents, the flexibility and governance capabilities of Kubernetes and GKE become indispensable in high-scale scenarios involving hundreds or thousands of agents. GKE truly shines when you need granular control over your agent deployments.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl08gkxy41hseuy3fljpu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl08gkxy41hseuy3fljpu.png" width="800" height="431"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  ADK on GKE
&lt;/h2&gt;

&lt;p&gt;Timestamp: &lt;a href="https://www.youtube.com/watch?v=5_R_Ixk8ENQ&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=1&amp;amp;t=418s" rel="noopener noreferrer"&gt;06:58&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We've discussed the &lt;a href="https://www.youtube.com/watch?v=aLYrV61rJG4&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=17" rel="noopener noreferrer"&gt;Agent Development Kit (ADK)&lt;/a&gt; in previous episodes, and Mofi highlighted to us how seamlessly it integrates with GKE and even showed a demo with the agent he built. ADK provides the framework for building the agent's logic, traces, and tools, while GKE provides the robust hosting environment. You can containerize your ADK agent, push it to Google Artifact Registry, and deploy it to GKE in minutes, transforming a local prototype into a globally accessible service.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Sandbox problem
&lt;/h2&gt;

&lt;p&gt;Timestamp: &lt;a href="https://www.youtube.com/watch?v=5_R_Ixk8ENQ&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=1&amp;amp;t=920s" rel="noopener noreferrer"&gt;15:20&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As agents become more sophisticated and capable of writing and executing code, a critical security concern emerges: the risk of untrusted, LLM-generated code. Brandon emphasized that while code execution is vital for high-performance agents and deterministic behavior, it also introduces significant risks in multi-tenant systems. This led us to the concept of a "sandbox."&lt;/p&gt;

&lt;h2&gt;
  
  
  What is a Sandbox?
&lt;/h2&gt;

&lt;p&gt;Timestamp: &lt;a href="https://www.youtube.com/watch?v=5_R_Ixk8ENQ&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=1&amp;amp;t=1158s" rel="noopener noreferrer"&gt;19:18&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For those less familiar with security engineering, Brandon clarified that a sandbox provides kernel and network isolation. Mofi further elaborated, explaining that agents often need to execute scripts (e.g., Python for data analysis). Without a sandbox, a hallucinating or prompt-injected model could potentially delete databases or steal secrets if allowed to run code directly on the main server. A sandbox creates a safe, isolated environment where such code can run without harming other systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agent Sandbox on GKE Demo
&lt;/h2&gt;

&lt;p&gt;Timestamp: &lt;a href="https://www.youtube.com/watch?v=5_R_Ixk8ENQ&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=1&amp;amp;t=1225s" rel="noopener noreferrer"&gt;20:25&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So, how do we build this "high fence" on Kubernetes? Brandon introduced the Agent Sandbox on Kubernetes, which leverages technologies like gVisor, an application kernel sandbox. When an agent needs to execute code, GKE dynamically provisions a completely isolated pod. This pod operates with its own kernel, network, and file system, effectively trapping any malicious code within the gVisor bubble. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fexw6cndzjl0w1ybb8mz1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fexw6cndzjl0w1ybb8mz1.png" width="800" height="301"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Mofi walked us through a compelling demo of the Agent Sandbox in action.We observed an ADK agent being given a task requiring code execution. As the agent initiated code execution, GKE dynamically provisioned a new pod, visibly labeled as "sandbox-executor," demonstrating the real-time isolation. Brandon highlighted that this pod is configured with strict network policies, further enhancing security.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feauxfwh9kazbqc32u7kz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feauxfwh9kazbqc32u7kz.png" width="800" height="330"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Future: Pod Snapshots
&lt;/h2&gt;

&lt;p&gt;Timestamp: &lt;a href="https://www.youtube.com/watch?v=5_R_Ixk8ENQ&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=1&amp;amp;t=1779s" rel="noopener noreferrer"&gt;29:39&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;While the Agent Sandbox offers incredible security, the latency of spinning up a new pod for every task is a concern. Mofi demoed the game-changing solution: Pod Snapshots. This technology allows us to save their state of running sandboxes and then near-instantly restore them when an agent needs them. Brandon noted that this reduces startup times from minutes to seconds, revolutionizing real-time agentic workflows on GKE.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0cfc4k9zczexdby59o0z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0cfc4k9zczexdby59o0z.png" width="800" height="743"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;It's incredible to see how GKE isn't just hosting agents; it's actively protecting them and making them faster. &lt;/p&gt;

&lt;h2&gt;
  
  
  Your turn to build
&lt;/h2&gt;

&lt;p&gt;Ready to put these concepts into practice? Dive into the full episode to see the demos in action and explore how GKE can supercharge your agentic workloads.&lt;/p&gt;

&lt;p&gt;Learn how to &lt;a href="https://docs.cloud.google.com/kubernetes-engine/docs/tutorials/agentic-adk-vertex?utm_campaign=CDR_0x036db2a4_default&amp;amp;utm_medium=external&amp;amp;utm_source=youtube" rel="noopener noreferrer"&gt;deploy an ADK agent to Google Kubernetes Engine&lt;/a&gt; and how to get your run agent to run code safely using the &lt;a href="http://docs.cloud.google.com/kubernetes-engine/docs/how-to/agent-sandbox" rel="noopener noreferrer"&gt;GKE agent Sandbox&lt;/a&gt;.  &lt;/p&gt;

&lt;h2&gt;
  
  
  Connect with us
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Shir Meir Lador → &lt;a href="https://www.linkedin.com/in/shirmeirlador/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;, &lt;a href="https://x.com/shirmeir86?lang=en" rel="noopener noreferrer"&gt;X&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Mofi Rahman → &lt;a href="https://www.linkedin.com/in/moficodes" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Brandon Royal → &lt;a href="https://www.linkedin.com/in/brandonroyal/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Agent Factory Recap: Reinforcement Learning and Fine-Tuning on TPUs</title>
      <dc:creator>Shir Meir Lador</dc:creator>
      <pubDate>Tue, 31 Mar 2026 18:56:42 +0000</pubDate>
      <link>https://dev.to/googleai/agent-factory-recap-reinforcement-learning-and-fine-tuning-on-tpus-1o6j</link>
      <guid>https://dev.to/googleai/agent-factory-recap-reinforcement-learning-and-fine-tuning-on-tpus-1o6j</guid>
      <description>&lt;p&gt;In our agent factory holiday special, Don McCasland and I were joined by Kyle Meggs, Senior Product Manager on the TPU Training Team at Google, to dive deep into the world of model fine tuning. We focused specifically on reinforcement learning (RL), and how Google's own infrastructure of TPUs are designed to power these massive workloads at scale.&lt;/p&gt;

&lt;p&gt;This post guides you through the key ideas from our conversation. Use it to quickly recap topics or dive deeper into specific segments with links and timestamps.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Consider Fine-Tuning
&lt;/h2&gt;

&lt;p&gt;Timestamp: &lt;a href="https://www.youtube.com/watch?v=qBOvM7SiDa4&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=2&amp;amp;t=193s" rel="noopener noreferrer"&gt;3:13&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We started with a fundamental question: with foundational models like Gemini becoming so powerful out of the box, and customization through the prompt can often be good enough, when should you consider fine-tuning? &lt;/p&gt;

&lt;p&gt;Fine tuning your own model is relevant when you need high specialization for unique datasets where a generalist model might not excel (such as in the medical domain), or when you have strict privacy restrictions that require hosting your own models trained on your data.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Model Lifecycle: Pre-training and Post-training (SFT and RL)
&lt;/h2&gt;

&lt;p&gt;Timestamp: &lt;a href="https://www.youtube.com/watch?v=qBOvM7SiDa4&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=1&amp;amp;t=232s" rel="noopener noreferrer"&gt;3:52&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;Kyle used a great analogy inspired by Andrej Karpathy to break down the stages of training. He described pre-training as "knowledge acquisition," similar to reading a chemistry textbook to learn how things work. Post-training is further split into Supervised Fine-Tuning (SFT), which is analogous to reading already-solved practice problems within the textbook chapter, and Reinforcement Learning (RL), which is like solving new practice problems without help and then checking your answers in the back of the book to measure yourself against an optimal approach and correct answers. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffc192k921af4wed7698x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffc192k921af4wed7698x.png" width="800" height="441"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Reinforcement Learning (RL) is Essential
&lt;/h2&gt;

&lt;p&gt;Timestamp: &lt;a href="https://www.youtube.com/watch?v=qBOvM7SiDa4&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=1&amp;amp;t=350s" rel="noopener noreferrer"&gt;5:50&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;We explored why RL is currently so important for building modern LLMs. Kyle explained that unlike SFT, which is about imitation, RL is about grading actions to drive "alignment." It’s crucial for teaching a model safety (penalizing what not to do), enabling the model to use tools like search and interact with the physical world through trial and error, and for performing verifiable tasks like math or coding by rewarding the entire chain of thought that leads to a correct answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Agent Industry Pulse: Why 2025 is the year of RL
&lt;/h2&gt;

&lt;p&gt;Timestamp: &lt;a href="https://www.youtube.com/watch?v=qBOvM7SiDa4&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=1&amp;amp;t=513s" rel="noopener noreferrer"&gt;8:33&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;In this segment, we looked at the rapidly evolving landscape of RL. Kyle noted that it is fair to call 2025 the "year of RL," highlighting the massive increase in investment and launches across the industry:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;January:&lt;/strong&gt; DeepSeek-R1 launched, making a huge splash with open-source GRPO.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Summer:&lt;/strong&gt; xAI launched Grok 4, reportedly running a 200k GPU cluster for RL at "pre-training scale."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;October:&lt;/strong&gt; A slew of new tooling launches across Google, Meta, and TML.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;November:&lt;/strong&gt; Gemini 3 launched as a premier thinking model.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Recent:&lt;/strong&gt; Google launched MaxText 2.0 for fine-tuning on TPUs.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F78ud8v71oa92vgbu4iz5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F78ud8v71oa92vgbu4iz5.png" alt="alt text" width="800" height="421"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hurdles of Implementing RL
&lt;/h2&gt;

&lt;p&gt;Timestamp: &lt;a href="https://www.youtube.com/watch?v=qBOvM7SiDa4&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=1&amp;amp;t=646s" rel="noopener noreferrer"&gt;10:46&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Following the industry trends, we discussed why RL is so difficult to implement. Kyle explained that RL combines the complexities of both training and inference into a single process. He outlined three primary challenges: managing infrastructure at the right balance and scale to avoid bottlenecks; choosing the right code, models, algorithms (like GRPO vs. DPO), and data; and finally, the difficulty of integrating disparate components for training, inference, orchestration, and weight synchronization.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjca0lpcpo23s95mzv876.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjca0lpcpo23s95mzv876.png" width="800" height="387"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To provide a solution across these dimensions of complexity, Google offers MaxText, a vertically integrated solution to help you perform RL in a highly scalable and performant fashion. MaxText provides highly optimized models, the latest post-training algorithms, high performance inference via LLM, and powerful scalability/flexibility via Pathways. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7rch212bej2n6eck8lq8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7rch212bej2n6eck8lq8.png" alt="alt text" width="800" height="385"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In contrast to DIY approaches where users assemble their own stack of disparate components from many different providers, Google’s approach offers a single integrated stack of co-designed components, from &lt;strong&gt;silicon&lt;/strong&gt; to &lt;strong&gt;software&lt;/strong&gt; to &lt;strong&gt;solutions&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fctihvw4xt9q6ajs1dfdp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fctihvw4xt9q6ajs1dfdp.png" width="800" height="510"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Factory Floor
&lt;/h2&gt;

&lt;p&gt;The Factory Floor is our segment for getting hands-on. Here, we moved from high-level concepts to practical code with a live demo.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why TPUs Shine for RL
&lt;/h2&gt;

&lt;p&gt;Timestamp: &lt;a href="https://www.youtube.com/watch?v=qBOvM7SiDa4&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=1&amp;amp;t=772s" rel="noopener noreferrer"&gt;12:52&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;Before diving into the demo, Kyle explained why TPUs are uniquely suited for complex AI workloads like RL. Unlike other hardware, TPUs were designed system-first. A TPU Pod can connect up to 9,216 chips over low-latency interconnects, allowing for massive scale without relying on standard data center networks. This is a huge advantage for overcoming RL bottlenecks like weight synchronization. Furthermore, because they are purpose-built for AI, they offer superior price-performance and thermal efficiency.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fitkt61wg3qhq2oobmryd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fitkt61wg3qhq2oobmryd.png" width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo: Reinforcement Learning (GRPO) with TPU
&lt;/h2&gt;

&lt;p&gt;Timestamp: &lt;a href="https://www.youtube.com/watch?v=qBOvM7SiDa4&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=1&amp;amp;t=953s" rel="noopener noreferrer"&gt;15:53&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;Don led a hands-on demonstration showing what RL looks like in action using Google's infrastructure. The demo showcased:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Using &lt;strong&gt;MaxText 2.0&lt;/strong&gt; as an integrated solution for the workload.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Leveraging models from MaxText and algorithms from Tunix.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Handling inference using vLLM.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Utilizing &lt;strong&gt;Pathways&lt;/strong&gt; for orchestration and scaling to run GRPO (Group Relative Policy Optimization).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl4tqmo8zv62i6oufqj8q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl4tqmo8zv62i6oufqj8q.png" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This holiday special was a great deep dive into the cutting edge of model fine tuning. While foundational models are getting better every day, the future of highly specialized, capable agents relies on mastering post-training techniques like RL, and having the right vertically integrated infrastructure, like TPUs, to run them efficiently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Your turn to build
&lt;/h2&gt;

&lt;p&gt;We hope this episode gave you valuable tools and perspectives to think about fine-tuning your own specialized agents. Be sure to check out the resources below to explore MaxText 2.0 and start experimenting with TPUs for your workloads. We'll see you next year for a revamped season of The Agent Factory!&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;p&gt;Post-Training Docs &lt;a href="https://maxtext.readthedocs.io/en/latest/tutorials/post_training_index.html" rel="noopener noreferrer"&gt;https://maxtext.readthedocs.io/en/latest/tutorials/post_training_index.html&lt;/a&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Google Cloud TPU (Ironwood) Documentation: &lt;a href="https://docs.cloud.google.com/tpu/docs/tpu7x" rel="noopener noreferrer"&gt;https://docs.cloud.google.com/tpu/docs/tpu7x&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Google Cloud open source code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MaxText - &lt;a href="https://github.com/AI-Hypercomputer/maxtext" rel="noopener noreferrer"&gt;https://github.com/AI-Hypercomputer/maxtext&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GPU recipes - &lt;a href="https://github.com/AI-Hypercomputer/gpu-recipes" rel="noopener noreferrer"&gt;https://github.com/AI-Hypercomputer/gpu-recipes&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;TPU recipes - &lt;a href="https://github.com/AI-Hypercomputer/tpu-recipes" rel="noopener noreferrer"&gt;https://github.com/AI-Hypercomputer/tpu-recipes&lt;/a&gt; &lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Andrej Karpathy - Chemistry Analogy: &lt;a href="https://youtu.be/7xTGNNLPyMI?si=Bubrqz_dPpvuqc1M&amp;amp;t=8069" rel="noopener noreferrer"&gt;Deep Dive into LLMs like ChatGPT&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Paper: "Small Language Models are the Future of Agentic AI" (Nvidia): &lt;a href="https://arxiv.org/abs/2506.02153" rel="noopener noreferrer"&gt;&lt;/a&gt;&lt;a href="https://arxiv.org/abs/2506.02153" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2506.02153&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Fine tuning blog: &lt;a href="https://cloud.google.com/blog/topics/developers-practitioners/a-step-by-step-guide-to-fine-tuning-medgemma-for-breast-tumor-classification?e=48754805" rel="noopener noreferrer"&gt;&lt;/a&gt;&lt;a href="https://cloud.google.com/blog/topics/developers-practitioners/a-step-by-step-guide-to-fine-tuning-medgemma-for-breast-tumor-classification?e=48754805" rel="noopener noreferrer"&gt;https://cloud.google.com/blog/topics/developers-practitioners/a-step-by-step-guide-to-fine-tuning-medgemma-for-breast-tumor-classification?e=48754805&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Connect with us
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Shir Meir Lador →  &lt;a href="https://www.linkedin.com/in/shirmeirlador/" rel="noopener noreferrer"&gt;https://www.linkedin.com/in/shirmeirlador/&lt;/a&gt;, &lt;a href="https://x.com/shirmeir86?lang=en" rel="noopener noreferrer"&gt;X&lt;/a&gt;  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Don McCasland →  &lt;a href="https://www.linkedin.com/in/donald-mccasland/" rel="noopener noreferrer"&gt;https://www.linkedin.com/in/donald-mccasland/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Kyle Meggs → &lt;a href="https://www.linkedin.com/in/kyle-meggs/" rel="noopener noreferrer"&gt;https://www.linkedin.com/in/kyle-meggs/&lt;/a&gt; &lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>gemini</category>
    </item>
    <item>
      <title>My First Experience Creating Antigravity Skills</title>
      <dc:creator>Shir Meir Lador</dc:creator>
      <pubDate>Fri, 20 Mar 2026 15:23:02 +0000</pubDate>
      <link>https://dev.to/googleai/my-first-experience-creating-antigravity-skills-524b</link>
      <guid>https://dev.to/googleai/my-first-experience-creating-antigravity-skills-524b</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7cvbil990snohnuztk9w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7cvbil990snohnuztk9w.png" width="700" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;small&gt;Experimenting with Agent skills for the first time, feeling empowered!&lt;/small&gt;&lt;/center&gt;

&lt;p&gt; &lt;br&gt;
Last week, I was at an event where we taught developers how to build &lt;a href="https://goo.gle/aaiwcr-1" rel="noopener noreferrer"&gt;MCP servers&lt;/a&gt;, &lt;a href="http://goo.gle/aaiwcr-2" rel="noopener noreferrer"&gt;agents&lt;/a&gt;, and &lt;a href="http://goo.gle/aaiwcr-3" rel="noopener noreferrer"&gt;deploy open models&lt;/a&gt; to &lt;a href="https://docs.cloud.google.com/run/docs?utm_campaign=CDR_0x91b1edb5_default_b491641592&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Google Cloud Run&lt;/a&gt;. After the session, one of the developers shared something that really stuck with me: he was already using our content to create specialized &lt;a href="https://antigravity.google/docs/skills" rel="noopener noreferrer"&gt;&lt;strong&gt;Skills&lt;/strong&gt;&lt;/a&gt; to share with his entire team.&lt;/p&gt;

&lt;p&gt;I got inspired and decided it was time to dive into &lt;a href="https://antigravity.google/docs/skills" rel="noopener noreferrer"&gt;Agent Skills&lt;/a&gt;. During my last project, the dev-signal agent, I had a lot of learnings about how to bring agents and AI applications to production in a robust and scalable manner. I thought, &lt;em&gt;this is a great opportunity to give my favorite coding agent, Google’s &lt;a href="https://www.antigravity.google/" rel="noopener noreferrer"&gt;Antigravity&lt;/a&gt; (Google’s “agent-first” IDE), those skills so that going forward, it will just do it for me!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In this post, I’ll walk through how I built the 13 production skills in this &lt;a href="https://github.com/GoogleCloudPlatform/devrel-demos/tree/main/ai-ml/dev-signal/.agent/skills" rel="noopener noreferrer"&gt;repository&lt;/a&gt; and the patterns behind them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What are Agent Skills?
&lt;/h2&gt;

&lt;p&gt;As &lt;a href="https://www.linkedin.com/in/iromin/?originalSubdomain=in" rel="noopener noreferrer"&gt;Romin Irani&lt;/a&gt; explains in &lt;a href="https://medium.com/google-cloud/tutorial-getting-started-with-antigravity-skills-864041811e0d" rel="noopener noreferrer"&gt;“Getting Started with Google Antigravity Skills”&lt;/a&gt;, skills represent a shift from monolithic context loading to &lt;strong&gt;Progressive Disclosure&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Agents get “overwhelmed” when providing them too many tools all at once (a phenomenon known as “&lt;a href="https://www.linkedin.com/posts/smithakolan_your-ai-agent-is-not-bad-at-reasoning-activity-7422342915089178624-awR3?rcm=ACoAAAYeeDsBfJzKJQaDuSjRnUBmKV20OJV2olc" rel="noopener noreferrer"&gt;Tool Bloat&lt;/a&gt;”), to solve for that, Skills allow the agent to “load” specialist knowledge only when needed. When you ask an agent to “evaluate a shadow revision,” it will figure out it will need to leverage the &lt;strong&gt;Shadow Deployer&lt;/strong&gt; skill as context for this operation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Workspace vs. Global Scope
&lt;/h2&gt;

&lt;p&gt;In Antigravity, you can manage these skills in two distinct ways depending on how you want to use them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Workspace Scope:&lt;/strong&gt; Located in &lt;em&gt;.agent/skills/&lt;/em&gt; within your project root. These are specific to your project and can be committed to GitHub so your entire team can benefit from the same production patterns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Global Scope:&lt;/strong&gt; Located in &lt;em&gt;~/.gemini/antigravity/skills/.&lt;/em&gt; These are your personal utilities that stay with you across every project you work on.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How I built the skills
&lt;/h2&gt;

&lt;p&gt;Following the principles in &lt;a href="https://www.linkedin.com/in/petruzalek/" rel="noopener noreferrer"&gt;Daniela Petruzalek&lt;/a&gt;’s &lt;a href="https://medium.com/google-cloud/building-agent-skills-with-skill-creator-855f18e785cf" rel="noopener noreferrer"&gt;“Building Agent Skills with skill-creator”,&lt;/a&gt; I took a “methodology-first” approach. I used the existing dev-signal blog series I’ve been working on and the &lt;a href="https://github.com/GoogleCloudPlatform/devrel-demos/tree/main/ai-ml/dev-signal" rel="noopener noreferrer"&gt;codebase&lt;/a&gt; itself as core context, asking Antigravity to identify and codify the unique skills needed to &lt;strong&gt;build a production agent on Google Cloud.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For some of the more specialized areas, I provided additional context with patterns I’d like to follow, such as the agent evaluation &lt;a href="https://codelabs.devsite.corp.google.com/codelabs/production-ready-ai-roadshow/2-evaluating-multi-agent-systems/evaluating-multi-agent-systems#0" rel="noopener noreferrer"&gt;codelab&lt;/a&gt; and &lt;a href="https://cloud.google.com/blog/topics/developers-practitioners/from-vibe-checks-to-continuous-evaluation-engineering-reliable-ai-agents?utm_campaign=CDR_0x91b1edb5_default_b491641592&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;blog&lt;/a&gt; and the agent security &lt;a href="https://codelabs.developers.google.com/codelabs/production-ready-ai-roadshow/3-securing-a-multi-agent-system/securing-a-multi-agent-system#0?utm_campaign=CDR_0x91b1edb5_default_b491641592&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;codelab&lt;/a&gt;, both written by my awesome team.&lt;/p&gt;

&lt;p&gt;These 13 skills provide Antigravity (or any developer using them) the crucial toolkit of a Google Cloud Production Engineer. I’m currently finalizing a detailed, step-by-step walkthrough of the dev-signal agent which will be published on the &lt;a href="https://cloud.google.com/blog" rel="noopener noreferrer"&gt;&lt;strong&gt;Google Cloud Blog&lt;/strong&gt;&lt;/a&gt; very soon! (follow me for future updates)&lt;/p&gt;

&lt;p&gt;In the meantime, you don’t have to wait — the full &lt;a href="https://github.com/GoogleCloudPlatform/devrel-demos/tree/main/ai-ml/dev-signal" rel="noopener noreferrer"&gt;repository&lt;/a&gt; and the &lt;a href="https://github.com/GoogleCloudPlatform/devrel-demos/tree/main/ai-ml/dev-signal/.agent/skills" rel="noopener noreferrer"&gt;skills&lt;/a&gt; are available for you to explore and leverage in your own projects today.&lt;/p&gt;

&lt;p&gt;Here is the full inventory of the skills:&lt;/p&gt;

&lt;h2&gt;
  
  
  🏗️ Production Agent
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;adk-memory-bank-initializer:&lt;/strong&gt; Long-term state logic with Vertex AI Memory Bank.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;agent-containerizer:&lt;/strong&gt; Mixed-runtime Dockerfiles (Python + Node.js).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;cloud-run-agent-architect:&lt;/strong&gt; Least-privilege Terraform for Cloud Run.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gcp-production-secret-handler:&lt;/strong&gt; In-memory secret fetching pattern (Secret Manager).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;mcp-connector-generator:&lt;/strong&gt; Standardized MCP connection logic.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  📊 Evaluation
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;gcp-agent-eval-engine-runner:&lt;/strong&gt; Parallel inference and reasoning trace capture.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gcp-agent-eval-metric-configurator:&lt;/strong&gt; Setup for Grounding and Tool Use rubrics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gcp-agent-golden-dataset-builder:&lt;/strong&gt; Tools for building datasets with reference trajectories.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gcp-agent-shadow-deployer:&lt;/strong&gt; “Dark Canary” deployment scripts with revision tagging.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gcp-agent-tool-trajectory-evaluator:&lt;/strong&gt; Custom Python metrics for Precision and Recall.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🛡️ Security
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;gcp-agent-model-armor-shield:&lt;/strong&gt; Intelligent firewall (Prompt Injection, RAI, Malicious URL filters).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gcp-agent-safety-gatekeeper:&lt;/strong&gt; Python integration pattern (safety_util.py) for sanitizing user inputs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gcp-agent-sdp-template-factory:&lt;/strong&gt; Terraform for Sensitive Data Protection (PII/Secret redaction).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By codifying these patterns to production skills, Antigravity can now leverage these automatically in my day to day development. I hope you find these as helpful as I do!&lt;/p&gt;

&lt;h2&gt;
  
  
  Pro tip - self improving skills!
&lt;/h2&gt;

&lt;p&gt;Because these skills were AI-generated, they might not work perfectly for your specific environment on the first try. But that’s actually the best part of working with an agentic IDE. If a skill doesn’t work well for you, don’t just manually fix the code, let the coding agent figure it out. Once it finds the solution, you can ask it to update the corresponding SKILL.md with the learned workflow. This will capture the corrected workflow for the future, ensuring the agent doesn’t repeat the mistake while saving you tokens and time on the next run. Think of these as living documents that actively improve as you build.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ready to get started?&lt;/strong&gt; Clone the &lt;a href="https://github.com/GoogleCloudPlatform/devrel-demos/tree/main/ai-ml/dev-signal" rel="noopener noreferrer"&gt;repository&lt;/a&gt; and add these skills to your Workspace or Global Scope to start building your own production-ready agents. Learn more about &lt;a href="https://antigravity.google/docs/skills" rel="noopener noreferrer"&gt;Agent skills.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Follow me on &lt;a href="https://www.linkedin.com/in/shirmeirlador/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; and &lt;a href="https://x.com/shirmeir86?lang=en" rel="noopener noreferrer"&gt;X&lt;/a&gt; for updates on my next blogs and videos.&lt;/p&gt;

</description>
      <category>antigravity</category>
      <category>ai</category>
      <category>googlecloud</category>
      <category>agents</category>
    </item>
    <item>
      <title>How I Turned an Ugly Spreadsheet into an AI Assisted App with Antigravity</title>
      <dc:creator>Shir Meir Lador</dc:creator>
      <pubDate>Wed, 18 Feb 2026 17:39:12 +0000</pubDate>
      <link>https://dev.to/googleai/how-i-turned-an-ugly-spreadsheet-into-an-ai-assisted-app-with-antigravity-3j52</link>
      <guid>https://dev.to/googleai/how-i-turned-an-ugly-spreadsheet-into-an-ai-assisted-app-with-antigravity-3j52</guid>
      <description>&lt;p&gt;&lt;strong&gt;I have a confession to make.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Up until now, I wasn’t that much into “vibe coding.” I used AI all the time for Python coding, but I never really built a whole app from scratch in a language I knew nothing about.&lt;/p&gt;

&lt;p&gt;That changed today. I encountered a really annoying problem: I had to review a massive amount of talk submissions for a conference. We’re talking about a massive spreadsheet. Staring at those tiny cells was literally making my eyes hurt.&lt;/p&gt;

&lt;p&gt;My initial thought was, “Hey, let’s create a really sharp UI for the submission review.” But then I thought, why stop there? Why not let AI provide me valuable inputs from social media to help me with the review itself?&lt;/p&gt;

&lt;p&gt;So, I decided to build &lt;strong&gt;TalkScout&lt;/strong&gt;. And since I wanted to test drive &lt;a href="https://antigravity.google/docs/home" rel="noopener noreferrer"&gt;Google Antigravity&lt;/a&gt; (Google’s new AI-powered coding agent), I figured this was the perfect opportunity.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fftvcnagk5wbmvmw2dxxt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fftvcnagk5wbmvmw2dxxt.png" alt="talkscout dashboard" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;small&gt;Talkscout Dashboard (synthetic data)&lt;/small&gt;&lt;/center&gt;

&lt;p&gt;Here is how I went from a painful CSV to a fully deployed &lt;a href="https://docs.cloud.google.com/run/docs?utm_campaign=CDR_0x91b1edb5_default_b473111509&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Cloud Run&lt;/a&gt; app-without writing a single line of React code myself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: The “Meta-Prompt” (Asking Gemini to Talk to Antigravity)
&lt;/h2&gt;

&lt;p&gt;I didn’t start by coding; I started by chatting. I used &lt;strong&gt;meta-prompting&lt;/strong&gt; to get started.&lt;/p&gt;

&lt;p&gt;So, what is meta-prompting, you may ask? It’s actually when you go to Gemini 3 and ask it to write the prompt for the coding agent.&lt;/p&gt;

&lt;p&gt;I explained my problem to &lt;strong&gt;Gemini 3&lt;/strong&gt; in simple words. Gemini 3 acted as my architect. It turned my “brain dump” requirements into a technical spec, defining the component structure and data model. I didn’t have to guess the right words, I just pasted that polished spec into Antigravity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Ditching the Spreadsheet for a Dashboard
&lt;/h2&gt;

&lt;p&gt;With that prompt, Antigravity built the app of my dreams. It allowed me to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Upload the CSV with all the conference talks.&lt;/li&gt;
&lt;li&gt;Get a dashboard showing the status of each talk.&lt;/li&gt;
&lt;li&gt;See a beautiful, high-contrast UI to review abstracts and demo plans without squinting at cells.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuv05d3jhgbptocmquud4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuv05d3jhgbptocmquud4.png" alt="TalkScout submission review page with high contrast UI" width="800" height="426"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;small&gt;TalkScout submission review page with high contrast UI&lt;/small&gt;&lt;/center&gt;

&lt;p&gt;&lt;strong&gt;The “Vibe” Fix:&lt;/strong&gt; It wasn’t all smooth sailing — I actually hit a nasty React hydration error. This can take hours to debug, especially if you’re not a frontend developer… But I simply provided the error message to Antigravity and the coding agent pinpointed the mismatch in the DOM and fixed it in minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Integrating Grounded Intelligence
&lt;/h2&gt;

&lt;p&gt;I didn’t just want a UI; I wanted to overcome my own bias. How do I know if a niche topic is actually hot?&lt;/p&gt;

&lt;p&gt;I added a button to get an &lt;strong&gt;AI Assessment&lt;/strong&gt;. But I didn’t want hallucinations. I used &lt;strong&gt;Google Search Grounding&lt;/strong&gt; so the AI could search through Reddit, X (Twitter), and LinkedIn for real-world developer signals. That provided me inputs based on the current developer audience mindshare.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftck7aytx9ecgnlrifq08.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftck7aytx9ecgnlrifq08.png" alt="TalkScout submission review page with AI social media analysis" width="800" height="410"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;small&gt;TalkScout submission review page with AI social media analysis&lt;/small&gt;&lt;/center&gt;

&lt;h2&gt;
  
  
  Step 4: Calibrating the “Strict” Reviewer
&lt;/h2&gt;

&lt;p&gt;Initially, the AI was way too nice. It was giving high scores to anything with trendy keywords.&lt;/p&gt;

&lt;p&gt;I used what’s called &lt;strong&gt;few-shot prompting&lt;/strong&gt; to calibrate it. I gave examples of my scores vs. its scores and introduced what I call the &lt;strong&gt;“Marketing Fluff Penalty”&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If a submission reads like a documentation/marketing page? Points docked.&lt;/li&gt;
&lt;li&gt;If the submission was way too short? We capped the score at a hard 2.&lt;/li&gt;
&lt;li&gt;If it includes war stories and actual learnings — increase rating.
After a few examples, it became more calibrated to my taste.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 5: The Pivot to Batch Mode
&lt;/h2&gt;

&lt;p&gt;I realized it was taking me too long to ask the AI to evaluate each talk individually while I reviewed it.&lt;/p&gt;

&lt;p&gt;So, I asked Antigravity to refactor the backend for &lt;strong&gt;Batch Mode&lt;/strong&gt;. Now, TalkScout processes the entire submission pool in the background. By the time I grab a coffee, the “AI Draft” column is full of insights, allowing me to focus only on the final decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 6: Sharing the Goodness (Deploy to &lt;a href="https://docs.cloud.google.com/run/docs?utm_campaign=CDR_0x91b1edb5_default_b473111509&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Cloud Run&lt;/a&gt;)
&lt;/h2&gt;

&lt;p&gt;TalkScout was working great for me, but I thought, “It would be great to share this with the other reviewers.”&lt;/p&gt;

&lt;p&gt;This is where Antigravity really showed off. I simply asked it to deploy the app. It automatically recognized my Google Cloud Project ID, handled the containerization, generated the exact deployment commands, and deployed it to &lt;a href="https://docs.cloud.google.com/run/docs?utm_campaign=CDR_0x91b1edb5_default_b473111509&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Cloud Run&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;One simple ask, and minutes later, I had a URL to share with the team.&lt;/p&gt;

&lt;h2&gt;
  
  
  It Was Pretty Fun!
&lt;/h2&gt;

&lt;p&gt;It was pretty fun to actually solve a real problem I had using Antigravity and vibe coding. I built a tool that handles ingestion, provides a distraction-free rating interface, and provides valuable inputs for my reviews.&lt;/p&gt;

&lt;p&gt;I would love to hear from you all - have you recently solved a problem using vibe coding?&lt;/p&gt;

&lt;p&gt;If you haven’t already - try playing around with &lt;a href="https://antigravity.google/docs/home" rel="noopener noreferrer"&gt;Antigravity&lt;/a&gt; and easily deploy your apps to &lt;a href="https://docs.cloud.google.com/run/docs?utm_campaign=CDR_0x91b1edb5_default_b473111509&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Cloud Run&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>antigravity</category>
      <category>ai</category>
      <category>gemini</category>
      <category>googlecloud</category>
    </item>
    <item>
      <title>Decoding high-bandwidth memory: A practical guide to GPU memory for fine-tuning AI models</title>
      <dc:creator>Shir Meir Lador</dc:creator>
      <pubDate>Thu, 15 Jan 2026 15:27:00 +0000</pubDate>
      <link>https://dev.to/googleai/decoding-high-bandwidth-memory-a-practical-guide-to-gpu-memory-for-fine-tuning-ai-models-56af</link>
      <guid>https://dev.to/googleai/decoding-high-bandwidth-memory-a-practical-guide-to-gpu-memory-for-fine-tuning-ai-models-56af</guid>
      <description>&lt;p&gt;We've all been there. You've meticulously prepared your dataset and written your training script. You hit &lt;strong&gt;run&lt;/strong&gt;, and your excitement builds, only to be crushed by the infamous error: CUDA out of memory.&lt;/p&gt;

&lt;p&gt;This is one of the most common roadblocks in AI development. Your GPU's &lt;a href="https://en.wikipedia.org/wiki/High_Bandwidth_Memory" rel="noopener noreferrer"&gt;High Bandwidth Memory (HBM)&lt;/a&gt;, is the high-speed memory that holds everything that's needed for computation, and running out of it is a hard stop. But how do you know how much you need?&lt;/p&gt;

&lt;p&gt;To build a clear foundation, we'll start by breaking down the HBM consumers on a single GPU and we'll present key strategies to reduce HBM consumption on a single GPU. Later, we'll explore advanced multi-GPU strategies like data and &lt;a href="https://huggingface.co/docs/transformers/v4.13.0/en/parallelism" rel="noopener noreferrer"&gt;model parallelism&lt;/a&gt; that can help relieve memory pressure and scale your training in the cloud.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding HBM: What's using all the memory?
&lt;/h2&gt;

&lt;p&gt;When you fine-tune a model, your HBM is primarily consumed by three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.webopedia.com/technology/llm-tokens-weights-parameters/#:~:text=in%20various%20contexts.-,What%20are%20LLM%20Weights?,or%20generate%20coherent%2C%20meaningful%20responses." rel="noopener noreferrer"&gt;Model Weights&lt;/a&gt;:&lt;/strong&gt; This is the most straightforward. It's the storage space required for the model's parameters—the "brain" that it uses to make predictions. A 7-billion parameter model loaded in 16-bit precision will take up roughly 14 GB before you even process a single piece of data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://eureka.patsnap.com/article/what-is-the-optimizer-state-in-deep-learning-training" rel="noopener noreferrer"&gt;Optimizer States&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/Gradient_descent" rel="noopener noreferrer"&gt;Gradients&lt;/a&gt;:&lt;/strong&gt; This is the overhead that's required for learning. To update the model's weights, the training process needs to calculate gradients (the direction of learning) and the &lt;a href="https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-guide-on-deep-learning-optimizers/#Adam_Deep_Learning_Optimizer" rel="noopener noreferrer"&gt;optimizer&lt;/a&gt; (like the popular &lt;a href="https://docs.pytorch.org/docs/stable/generated/torch.optim.AdamW.html" rel="noopener noreferrer"&gt;AdamW&lt;/a&gt;) needs to store its own data to guide the training. In full fine-tuning, this can be the largest consumer of HBM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://en.wikipedia.org/wiki/Activation_function" rel="noopener noreferrer"&gt;Activations&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/Online_machine_learning#Batch_learning" rel="noopener noreferrer"&gt;Batch Data&lt;/a&gt;:&lt;/strong&gt; This is the most dynamic part. When your data (images, text, etc.) flows through the model's layers, the intermediate calculations, or activations, are stored in HBM. The memory needed here is directly proportional to your batch size. A larger batch size means more activations are stored simultaneously, which leads to faster training but much higher memory usage.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; These calculations are theoretical minimums. Real-world frameworks add up to 30% overhead due to &lt;a href="https://arxiv.org/abs/1910.02054" rel="noopener noreferrer"&gt;temporary buffers, kernel launches, and memory fragmentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Although it's impossible to get a perfect number without experimentation, you can estimate your HBM needs with this general formula:&lt;br&gt;
&lt;em&gt;&lt;center&gt;Total HBM ≈ (Model Size) + (Optimizer States) + (Gradients) + (Activations)&lt;/center&gt;&lt;/em&gt;&lt;br&gt;
 &lt;br&gt;
&lt;strong&gt;Further reading:&lt;/strong&gt; See this excellent JAX e-book that covers &lt;a href="https://jax-ml.github.io/scaling-book/gpus/" rel="noopener noreferrer"&gt;these topics&lt;/a&gt; in great detail and even has some &lt;a href="https://jax-ml.github.io/scaling-book/gpus/#quiz-5-llm-rooflines" rel="noopener noreferrer"&gt;"try it out yourself" test questions&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Example: Why full fine-tuning is so demanding
&lt;/h2&gt;

&lt;p&gt;To see why running out of memory is such a common problem, let's walk through a real-world example that I recently worked on: fine-tuning the &lt;a href="https://deepmind.google/models/gemma/medgemma/" rel="noopener noreferrer"&gt;medgemma-4b-it model&lt;/a&gt;, which has 4 billion parameters. Our &lt;a href="https://cloud.google.com/blog/topics/developers-practitioners/a-step-by-step-guide-to-fine-tuning-medgemma-for-breast-tumor-classification" rel="noopener noreferrer"&gt;script&lt;/a&gt; loads it in bfloat16 precision (2 bytes per parameter).&lt;/p&gt;

&lt;p&gt;First, let's calculate the static HBM footprint. This is the memory that's required just to load the model and prepare it for training, before you've even processed a single piece of data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Model Size:&lt;/strong&gt; The memory that's needed to simply hold the model on the GPU.&lt;/p&gt;

&lt;center&gt;4 billion parameters × 2 bytes/parameter = 8 GB&lt;/center&gt;

&lt;p&gt; &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Gradients and Optimizer States:&lt;/strong&gt; The overhead for training every parameter with the AdamW optimizer.&lt;/p&gt;

&lt;center&gt;Gradients: 4 billion parameters × 2 bytes/parameter = 8 GB&lt;/center&gt;

&lt;center&gt;Optimizer States (AdamW): 2 × 4 billion parameters × 2 bytes/parameter = 16 GB&lt;/center&gt;

&lt;p&gt; &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; While AdamW is a popular optimizer, other optimizers, such as Adafactor and Lion, have different memory footprints.&lt;/p&gt;

&lt;p&gt;Adding these together gives us our baseline HBM cost for a full fine-tuning attempt:&lt;/p&gt;

&lt;center&gt;8 GB (Model) + 8 GB (Gradients) + 16 GB (Optimizer) = 32 GB&lt;/center&gt;

&lt;p&gt; &lt;/p&gt;

&lt;p&gt;This 32 GB is the baseline just to start the training process. On top of this, the GPU needs &lt;strong&gt;additional memory for activations&lt;/strong&gt;, which is a &lt;em&gt;dynamic&lt;/em&gt; cost that grows with your batch size and input data size. This is why full fine-tuning of large models is so demanding and often reserved for the most powerful hardware.&lt;/p&gt;
&lt;h2&gt;
  
  
  Key strategies to reduce HBM consumption
&lt;/h2&gt;

&lt;p&gt;The HBM requirement for a full fine-tune can seem impossibly high. But several powerful techniques can reduce memory consumption, making it feasible to train large models on consumer-grade or entry-level professional GPUs.&lt;/p&gt;
&lt;h3&gt;
  
  
  Parameter-Efficient Fine-Tuning (PEFT) with LoRA
&lt;/h3&gt;

&lt;p&gt;Instead of training all the billions of parameters in a model, &lt;a href="https://huggingface.co/docs/peft/en/index" rel="noopener noreferrer"&gt;Parameter-Efficient Fine-Tuning (PEFT)&lt;/a&gt; methods focus on training only a small subset of parameters. The most popular of these is &lt;a href="https://arxiv.org/abs/2106.09685" rel="noopener noreferrer"&gt;LoRA (Low-Rank Adaptation)&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cloud.google.com/vertex-ai/generative-ai/docs/model-garden/lora-qlora?utm_campaign=CDR_0x91b1edb5_default_b451009911&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;LoRA&lt;/a&gt; works by freezing &lt;strong&gt;the original model's weights and injecting a tiny number of new, trainable &lt;em&gt;adapter&lt;/em&gt; layers&lt;/strong&gt; into the model architecture. This means the memory-hungry gradients and optimizer states are only needed for these few million new parameters, not the full 4 billion.&lt;/p&gt;
&lt;h4&gt;
  
  
  The math behind LoRA's memory savings
&lt;/h4&gt;

&lt;p&gt;LoRA doesn't remove the base model from your GPU. The full 8 GB of the original model's weights are still loaded and taking up HBM. They're just frozen, which means that the GPU isn't training them. All of the memory savings come from the fact that you no longer need to store the huge gradients and optimizer states for that massive, frozen part of the model.&lt;/p&gt;

&lt;p&gt;Let's recalculate the static HBM footprint with LoRA, assuming it adds 20 million trainable parameters:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Model Size (unchanged):&lt;/strong&gt; The base model is still loaded.&lt;/p&gt;

&lt;center&gt;4 billion parameters × 2 bytes/parameter = 8 GB&lt;/center&gt;

&lt;p&gt; &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. LoRA Gradients &amp;amp; Optimizer States:&lt;/strong&gt; We now only need overhead for the tiny set of new parameters.&lt;/p&gt;

&lt;center&gt;Gradients: 20 million parameters × 2 bytes/parameter = 40 MB&lt;/center&gt;

&lt;center&gt;
Optimizer States: 2 × 20 million parameters × 2 bytes/parameter = 80 MB&lt;/center&gt;

&lt;p&gt; &lt;/p&gt;

&lt;p&gt;The new static HBM footprint is now:&lt;/p&gt;

&lt;center&gt;8 GB (Model) + 40 MB (Gradients) + 80 MB (Optimizer) ≈ 8.12 GB&lt;/center&gt;

&lt;p&gt; &lt;/p&gt;

&lt;p&gt;The training overhead has shrunk from 24 GB to just 120 MB. Your new baseline memory requirement is now just over 8 GB. This lower baseline memory requirement leaves much more room for the dynamic memory that's needed for activations, which lets you use a reasonable batch size on a common 16 GB or 24 GB GPU without running out of memory.&lt;/p&gt;
&lt;h3&gt;
  
  
  Model quantization
&lt;/h3&gt;

&lt;p&gt;Besides training fewer parameters, we can also shrink the ones that we have by using &lt;a href="https://huggingface.co/docs/optimum/en/concept_guides/quantization" rel="noopener noreferrer"&gt;quantization&lt;/a&gt;, which involves reducing the &lt;a href="https://arxiv.org/html/2410.13857v1" rel="noopener noreferrer"&gt;numerical precision&lt;/a&gt; of the model's weights. The standard precision for modern training is &lt;a href="https://en.wikipedia.org/wiki/Bfloat16_floating-point_format" rel="noopener noreferrer"&gt;bfloat16&lt;/a&gt; because it offers the dynamic range of float32 with half the memory footprint. But we can reduce HBM usage further by converting weights to lower-precision integer formats like int8 or int4.&lt;/p&gt;

&lt;p&gt;Using lower-precision integer formats has a significant impact on HBM when compared to the standard bfloat16 baseline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;bfloat16 (standard):&lt;/strong&gt; The baseline size (e.g., a 7B model requires &lt;strong&gt;~14 GB&lt;/strong&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;8-bit precision:&lt;/strong&gt; Halves the model size (e.g., 14 GB becomes &lt;strong&gt;~7 GB&lt;/strong&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4-bit precision:&lt;/strong&gt; Reduces the model size by a factor of 4 (e.g., 14 GB becomes &lt;strong&gt;~3.5 GB&lt;/strong&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The reduction in size lets you fit much larger models into memory with minimal degradation in performance.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  

&lt;p&gt;&lt;strong&gt;A word of warning from experience:&lt;/strong&gt;&lt;br&gt;
When I started experimenting in this area, my first attempt to load the model using the common float16 data type failed spectacularly. The model's outputs were NaN, and a quick check revealed that every internal value had collapsed into NaN (Not a Number) .&lt;/p&gt;

&lt;p&gt;The culprit was a classic &lt;a href="https://en.wikipedia.org/wiki/Integer_overflow" rel="noopener noreferrer"&gt;numerical overflow&lt;/a&gt;. The float16 data type has a tiny numerical range and it can't represent any number larger than 65,504. During training, intermediate values can easily exceed this limit, causing an overflow that creates a NaN. The fix was a simple one-line change to bfloat16, which has a massive numerical range that prevents these overflows and keeps training stable. For fine-tuning large models, always prefer bfloat16 for stability.&lt;/p&gt;


&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;&lt;a href="https://arxiv.org/abs/2305.14314" rel="noopener noreferrer"&gt;Combining LoRA and Quantization:&lt;/a&gt;&lt;/strong&gt; These techniques work best together. Quantized LoRA (QLoRA) is a method that stores the massive base model in a highly efficient 4-bit format (specifically NF4 or NormalFloat 4), while adding small, trainable LoRA adapters in bfloat16. During the training process, the 4-bit weights are dequantized to bfloat16 for computation. Dequantizing in process lets you fine-tune very large models on a single GPU with the memory savings of 4-bit storage and the mathematical stability of 16-bit training.&lt;/p&gt;

&lt;h3&gt;
  
  
  FlashAttention: An algorithmic speed boost
&lt;/h3&gt;

&lt;p&gt;Finally, &lt;a href="https://arxiv.org/abs/2205.14135" rel="noopener noreferrer"&gt;FlashAttention&lt;/a&gt; is a foundational algorithmic optimization that significantly reduces HBM usage and speeds up training on both single and multi-GPU setups. The attention mechanism in transformers is a primary memory bottleneck because it requires storing a large, intermediate &lt;a href="https://en.wikipedia.org/wiki/Attention_%28machine_learning%29" rel="noopener noreferrer"&gt;attention matrix&lt;/a&gt;. FlashAttention cleverly reorders the computation to avoid storing this full matrix in memory, leading to substantial memory savings and faster execution.&lt;/p&gt;

&lt;p&gt;Best of all, enabling FlashAttention is often as simple as a one-line change. In the MedGemma fine-tuning script, this was done by setting the value &lt;code&gt;attn_implementation="sdpa"&lt;/code&gt;, which can automatically use more efficient backends like FlashAttention if the hardware supports it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scaling beyond a single GPU: Advanced strategies
&lt;/h2&gt;

&lt;p&gt;Techniques like LoRA and quantization are useful for lowering HBM needs on a single GPU. But to train truly massive models or to really speed up the process, you'll eventually need to scale out to multiple GPUs. Here are some of the key strategies that can be used to distribute the load and overcome memory limitations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data parallelism
&lt;/h3&gt;

&lt;p&gt;Data parallelism is the most common and intuitive approach to scaling. In a Distributed Data Parallel (DDP) setup, the entire model is replicated on each GPU. The key is that the global batch of training data is split, with each GPU processing its own mini-batch concurrently. After each forward and backward pass, the gradients from each GPU are averaged together to ensure that all of the model replicas learn from the entire dataset and they stay in sync. This method is excellent for &lt;strong&gt;speeding up training&lt;/strong&gt; but it &lt;strong&gt;doesn't reduce the HBM&lt;/strong&gt; that's required to hold the model itself, because every GPU needs a full copy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Model parallelism
&lt;/h3&gt;

&lt;p&gt;When a model is too large to fit into the memory of a single GPU, you must use &lt;a href="https://en.wikipedia.org/wiki/Data_parallelism#Data_parallelism_vs._model_parallelism" rel="noopener noreferrer"&gt;model parallelism&lt;/a&gt;. Instead of replicating the model, this strategy &lt;strong&gt;splits the model&lt;/strong&gt; across multiple GPUs. There are two primary ways to do this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://huggingface.co/docs/text-generation-inference/en/conceptual/tensor_parallelism" rel="noopener noreferrer"&gt;Tensor parallelism&lt;/a&gt;:&lt;/strong&gt; This method splits a single large operation (like a massive weight matrix in a transformer layer) across several GPUs. Each GPU computes its part of the operation, and the results are combined.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://docs.pytorch.org/docs/stable/distributed.pipelining.html" rel="noopener noreferrer"&gt;Pipeline parallelism&lt;/a&gt;:&lt;/strong&gt; This technique places different layers of the model onto different GPUs in a sequence. The data flows through the first set of layers on GPU 1, then the output is passed to GPU 2 for the next set of layers, and so on, like an assembly line.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These strategies are more complex to implement than data parallelism, but they're essential for models that are simply too big for one device.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fully Sharded Data Parallelism (FSDP)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html" rel="noopener noreferrer"&gt;FSDP&lt;/a&gt; is a powerful and efficient hybrid strategy that combines the ideas of &lt;strong&gt;data parallelism&lt;/strong&gt; and &lt;strong&gt;model parallelism&lt;/strong&gt;. Unlike standard data parallelism where each GPU holds a full copy of the model, optimizer states, and gradients, FSDP shards (or splits) all of these components across the GPUs. Each GPU only materializes the full parameters for the &lt;strong&gt;specific layer&lt;/strong&gt; that it's computing at that moment, &lt;strong&gt;dramatically reducing the peak HBM&lt;/strong&gt; usage per device. FSDP makes it possible to train enormous models on a cluster of smaller GPUs.&lt;/p&gt;

&lt;p&gt;By combining these hardware and software strategies, you can &lt;strong&gt;scale your fine-tuning jobs&lt;/strong&gt; from a single GPU to a &lt;strong&gt;powerful, distributed cluster&lt;/strong&gt; capable of handling even the most demanding AI models.&lt;/p&gt;

&lt;h2&gt;
  
  
  HBM sizing guide
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;HBM&lt;/th&gt;
&lt;th&gt;Use case and explanation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;16 GB&lt;/td&gt;
&lt;td&gt;Sufficient for basic inference or fine-tuning with techniques like LoRA using a very small batch size (e.g., 1-2). Expect slower training times at this level.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;24 GB&lt;/td&gt;
&lt;td&gt;The recommended starting point for a good experience with 4-7 B parameter models. This capacity allows for a more effective batch size (e.g., 8-16) when using LoRA, providing a great balance of training speed and cost.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;40+ GB&lt;/td&gt;
&lt;td&gt;Necessary for maximizing training speed with large batch sizes or for working with larger models (in the 20+ B parameter range) now or in the future.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Encountering the CUDA out of memory error provides an important lesson in the trade-offs between model size, training techniques, and batch size. By understanding what consumes your HBM, you can make smarter decisions and keep your projects running smoothly.&lt;/p&gt;

&lt;p&gt;I hope that this guide has helped clarify the CUDA out of memory error and that it's given you the tools to keep your projects running smoothly. When you're ready to take the next step, Google Cloud has the tools to accelerate your AI development.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Explore &lt;a href="https://cloud.google.com/run/docs/configuring/services/gpu?utm_campaign=CDR_0x91b1edb5_default_b451009911&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;GPU configurations for your Cloud Run services&lt;/a&gt; and best practices for running &lt;a href="https://cloud.google.com/run/docs/configuring/jobs/gpu-best-practices?hl=en&amp;amp;utm_campaign=CDR_0x91b1edb5_default_b451009911&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Cloud Run jobs with GPU&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;For maximum control: Spin up a &lt;a href="https://cloud.google.com/products/compute" rel="noopener noreferrer"&gt;Compute Engine&lt;/a&gt; instance with the latest NVIDIA H100 or A100 Tensor Core GPUs and take full control of your environment.&lt;/li&gt;
&lt;li&gt;Looking to optimize your model hosting infrastructure? Take a look at &lt;a href="https://cloud.google.com/blog/topics/developers-practitioners/vllm-performance-tuning-the-ultimate-guide-to-xpu-inference-configuration?utm_campaign=CDR_0x91b1edb5_default_b451009911&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;The Ultimate Guide to xPU Inference Configuration&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;For a deeper dive into scaling your model, check out &lt;a href="https://jax-ml.github.io/scaling-book" rel="noopener noreferrer"&gt;How to Scale Your Model&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;New to Google Cloud? Get started with the $300 free credit to find the perfect solution for your next project.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Special thanks to Jason Monden and Sayce Falk from the AI compute team for their helpful review and feedback on this post.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>gpu</category>
      <category>performance</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Agent Factory Recap: Can you do my shopping?</title>
      <dc:creator>Shir Meir Lador</dc:creator>
      <pubDate>Fri, 19 Dec 2025 19:44:58 +0000</pubDate>
      <link>https://dev.to/googleai/agent-factory-recap-can-you-do-my-shopping-5f8k</link>
      <guid>https://dev.to/googleai/agent-factory-recap-can-you-do-my-shopping-5f8k</guid>
      <description>&lt;p&gt;In episode #8 of &lt;a href="https://www.youtube.com/playlist?list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs" rel="noopener noreferrer"&gt;The Agent Factory&lt;/a&gt;, Ivan Nardini and I are joined by Prateek Dudeja, product manager from the Agent Payment Protocol Team, to dive into one of the biggest hurdles for &lt;a href="https://cloud.google.com/discover/what-are-ai-agents?e=48754805&amp;amp;hl=en&amp;amp;utm_campaign=CDR_0x6e136736_awareness_b446653415&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;AI agents&lt;/a&gt; in eccomerce: trust, especially when it comes to money.&lt;/p&gt;

&lt;p&gt;This post guides you through the key ideas from our conversation. Use it to quickly recap topics or dive deeper into specific segments with links and timestamps.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introducing Agent Payment Protocol
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Timestamp: [&lt;a href="https://www.youtube.com/watch?v=T1MtWnEYXM0&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=1&amp;amp;t=103s" rel="noopener noreferrer"&gt;01:43&lt;/a&gt;]&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;What if an agent could buy concert tickets for you at a specific time that the tickets go on sale. You don't want to miss out! Maybe you want two tickets, and you don't want to spend more than $200. You definitely want to sit in a section with a great view of the stage. To have an agent act as your ticket buyer, you would have to trust that agent with all facets of your request and your credit card. How can you be sure that the agent won't buy 200 tickets or that it won't charge you for a lifetime supply of rubber duckies?&lt;/p&gt;

&lt;p&gt;The potential for a messy outcome with this concert ticket request provides insight into a "&lt;strong&gt;Crisis of Trust&lt;/strong&gt;" that can hold back agentic commerce. The good news is there's a way to move forward and build trust. &lt;/p&gt;

&lt;p&gt;To solve the "Crisis of Trust," Google introduced the &lt;a href="https://github.com/google-agentic-commerce/AP2" rel="noopener noreferrer"&gt;Agent Payment Protocol (AP2)&lt;/a&gt;, a new open standard. It's not a new payment system; it’s a "&lt;strong&gt;trust layer&lt;/strong&gt;" that sits on top of existing infrastructure. AP2 is designed to create a common, secure language for agents to conduct commerce, using role-based architecture and verifiable credentials.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2i6f0zm7dqgtryjkgpf3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2i6f0zm7dqgtryjkgpf3.png" width="800" height="363"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Agent Payments and the Current Payment System
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Timestamp: [&lt;a href="https://www.youtube.com/watch?v=T1MtWnEYXM0&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=1&amp;amp;t=149s" rel="noopener noreferrer"&gt;02:29&lt;/a&gt;]&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The current payment system was built for humans using trusted interfaces like browsers, not for autonomous agents, resulting in three main challenges for agents: &lt;strong&gt;authorization, agent error, and accountability&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvok8osf319hbwwru4p8r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvok8osf319hbwwru4p8r.png" width="800" height="566"&gt;&lt;/a&gt;&lt;br&gt;
The &lt;strong&gt;Agent Payment Protocol&lt;/strong&gt; addresses these challenges by helping agents communicate securely with merchants and payment partners. The Agent Payment Protocol is available today as an extension for the &lt;a href="https://a2a-protocol.org/" rel="noopener noreferrer"&gt;A2A (Agent2Agent) protocol&lt;/a&gt; and relies on agents using the &lt;a href="https://modelcontextprotocol.io/" rel="noopener noreferrer"&gt;Model Context Protocol (MCP)&lt;/a&gt;. &lt;/p&gt;

&lt;h2&gt;
  
  
  Deep Dive into the Agent Payment Protocol
&lt;/h2&gt;

&lt;p&gt;Learn more about how this protocol works, including concepts and flow.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Role-Based Ecosystem
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Timestamp: [&lt;a href="https://www.youtube.com/watch?v=T1MtWnEYXM0&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=1&amp;amp;t=273s" rel="noopener noreferrer"&gt;04:33&lt;/a&gt;]&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The protocol is built on a "separation of concerns." Your agent doesn't have to do everything. There are specialized roles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Shopping Agent&lt;/strong&gt;: The AI agent you build, great at finding products.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Merchant Endpoint&lt;/strong&gt;: The seller's API.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Credential Provider&lt;/strong&gt;: A secure digital wallet (like PayPal, Google Pay, etc.) that manages payment details.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Merchant Payment Processor&lt;/strong&gt;: The entity that constructs the final authorization message for the payment networks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0snpazgnllpzauxu1di0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0snpazgnllpzauxu1di0.png" width="800" height="426"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;_Critical&lt;/strong&gt;: Your shopping agent never touches the raw credit card number. It doesn't need to be PCI compliant because it delegates the payment to the specialized, secure providers._&lt;/p&gt;

&lt;h3&gt;
  
  
  Verifiable Credentials (VCs)
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Timestamp: [&lt;a href="https://www.youtube.com/watch?v=T1MtWnEYXM0&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=1&amp;amp;t=375s" rel="noopener noreferrer"&gt;06:15&lt;/a&gt;]&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The "handshakes" between these roles in the Agent Payment Protocol ecosystem are secured by Verifiable Credentials (VCs). Think of credentials as protocolized, cryptographically signed digital receipts that prove what was agreed upon.&lt;/p&gt;

&lt;p&gt;There are three types of verifiable credentials:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cart Mandate&lt;/strong&gt;: For "human-present" scenarios. The user reviews a final cart and cryptographically signs it as proof of approval.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intent Mandate&lt;/strong&gt;: For "human-not-present" scenarios (like the concert ticket example). The user signs an intent (e.g., "buy tickets under $200"), giving the agent authority to act within those guardrails.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Payment Mandate&lt;/strong&gt;: Provides clear visibility to payment networks and banks that an AI agent was involved in the transaction.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F21rb8f2ntkuaeo3nhe0o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F21rb8f2ntkuaeo3nhe0o.png" width="800" height="404"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  A Contractual Conversational Model
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Timestamp: [&lt;a href="https://www.youtube.com/watch?v=T1MtWnEYXM0&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=1&amp;amp;t=483s" rel="noopener noreferrer"&gt;08:03&lt;/a&gt;]&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The Agent Payment Protocol process creates a "Contractual Conversational Model," moving beyond simple API calls to a flow built on verifiable proof.&lt;/p&gt;

&lt;p&gt;To understand this flow, we'll walk through a &lt;strong&gt;human-present scenario&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Delegation&lt;/strong&gt;: You tell your agent, "Buy two concert tickets."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discovery &amp;amp; Negotiation&lt;/strong&gt;: The agent contacts the merchant's endpoint to prepare the cart.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Finalize Cart&lt;/strong&gt;: The agent reaches out to your Credential Provider (e.g., your digital wallet). You select the payment method. The agent only gets a reference (like the last 4 digits), never the full credential.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authorization with Mandates&lt;/strong&gt;: The agent shows you the final, finalized cart.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You cryptographically sign the Cart Mandate&lt;/strong&gt;. This is the non-repudiable proof, the "contract."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Purchase&lt;/strong&gt;: The agent sends this signed mandate to the merchant. The merchant can now trust the purchase mandate is from you. The merchant's payment processor uses the mandate to securely get the payment token from the credential provider and complete the transaction.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqd4ve1quoy9x6b3zraod.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqd4ve1quoy9x6b3zraod.png" width="800" height="431"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This flow all hinges on trust. In the short term, this trust is built using &lt;strong&gt;manual allow lists&lt;/strong&gt; of approved agents and merchants. In the long term, the plan is to use open web standards like HTTPS and DNS ownership to verify identities.&lt;/p&gt;

&lt;h2&gt;
  
  
  Q&amp;amp;A with Prateek Dudeja
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Timestamp: [&lt;a href="https://www.youtube.com/watch?v=T1MtWnEYXM0&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=1&amp;amp;t=787s" rel="noopener noreferrer"&gt;13:07&lt;/a&gt;]&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;With the concepts explained, the discussion moved to a Q&amp;amp;A with Prateek.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a New Protocol for Payments?
&lt;/h2&gt;

&lt;p&gt;_Timestamp: [&lt;a href="https://www.youtube.com/watch?v=T1MtWnEYXM0&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=1&amp;amp;t=810s" rel="noopener noreferrer"&gt;13:30&lt;/a&gt;]&lt;br&gt;
_&lt;br&gt;
Prateek gave a great analogy: HTTPS is a baseline protocol for browsing. Signing in requires stronger authentication. Making a &lt;strong&gt;payment&lt;/strong&gt; requires an even higher level of trust. AP2 provides that "payments-grade security" on top of baseline protocols like A2A and MCP, ensuring the transaction is high-trust and truly from a human.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Will Agents Find Trusted Partners?
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Timestamp: [&lt;a href="https://www.youtube.com/watch?v=T1MtWnEYXM0&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=1&amp;amp;t=882s" rel="noopener noreferrer"&gt;14:42&lt;/a&gt;]&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In the short term, agents will use "decentralized registries of trust" (or allow lists) to find merchants they can interact with. Prateek noted that all the roles (merchant, credential provider, etc.) already exist in the payments industry today. The only new role is the &lt;strong&gt;Shopping Agent&lt;/strong&gt; itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  Accountability: What Happens When Things Go Wrong?
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Timestamp: [&lt;a href="https://www.youtube.com/watch?v=T1MtWnEYXM0&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=1&amp;amp;t=963s" rel="noopener noreferrer"&gt;16:03&lt;/a&gt;]&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is the big question. What if your agent shows you &lt;em&gt;blue&lt;/em&gt; shoes, you wanted &lt;em&gt;teal&lt;/em&gt;, but you click "approve" anyway?&lt;/p&gt;

&lt;p&gt;Prateek explained that the signed &lt;strong&gt;Cart Mandate&lt;/strong&gt; solves this. Because you biometrically signed a tamper-proof credential showing the &lt;em&gt;blue&lt;/em&gt; shoes, the responsibility is on you. The merchant has cryptographic evidence that you saw and approved the exact product. This protects merchants from fraudulent chargebacks and users from unauthorized agent actions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Demo: Reference Implementation
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Timestamp: [&lt;a href="https://www.youtube.com/watch?v=T1MtWnEYXM0&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=1&amp;amp;t=1084s" rel="noopener noreferrer"&gt;18:04&lt;/a&gt;]&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Prateek walked through a demo showing the human-present flow. It showed the user prompting the agent, the agent discovering products, and then the &lt;strong&gt;Credential Provider (PayPal)&lt;/strong&gt; getting involved. The user selected their shipping and payment info &lt;em&gt;from PayPal&lt;/em&gt;, and the agent only saw a reference. The user then signed the Cart Mandate, and the purchase was completed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Compatibility and Getting Started
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Timestamp: [&lt;a href="https://www.youtube.com/watch?v=T1MtWnEYXM0&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=1&amp;amp;t=1183s" rel="noopener noreferrer"&gt;19:43&lt;/a&gt;]&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A key question was - Is this compatible with frameworks like LangGraph or CrewAI? &lt;strong&gt;Yes&lt;/strong&gt;. Prateek confirmed the protocol is compatible with any framework. As long as your agent can communicate over A2A or MCP, you can use AP2.&lt;/p&gt;

&lt;p&gt;To get started, Prateek directed developers to the &lt;a href="https://github.com/google-agentic-commerce/AP2" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt;. The first step is to see which role you want to play (merchant, credentials provider, etc.) and explore the sample code for that role.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Future: Dynamic Negotiation
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Timestamp: [&lt;a href="https://www.youtube.com/watch?v=T1MtWnEYXM0&amp;amp;list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs&amp;amp;index=1&amp;amp;t=1273s" rel="noopener noreferrer"&gt;21:13&lt;/a&gt;]&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Looking ahead, Prateek shared an exciting vision for "dynamic negotiation." Imagine telling your agent: "I want that red dress that's out of stock. I need it by tomorrow... and I'm willing to pay 30% more".&lt;/p&gt;

&lt;p&gt;A merchant's agent could see this "intent" and, if the dress becomes available, automatically complete the sale. What was a lost sale for the merchant becomes a completed order at a markup, and the user gets the exact item they desperately wanted. &lt;/p&gt;

&lt;h2&gt;
  
  
  Your turn to build
&lt;/h2&gt;

&lt;p&gt;This conversation made it clear that building a secure payment infrastructure is a foundational step toward creating agents that can perform truly useful tasks in the real world. We're moving from a simple, programmatic web to a conversational, contractual one, and this protocol provides the framework for it.&lt;/p&gt;

&lt;p&gt;We encourage you to check out the &lt;a href="https://github.com/google-agentic-commerce/AP2" rel="noopener noreferrer"&gt;Agent Payment Protocol GitHub repo&lt;/a&gt;, think about which role you could play in this new ecosystem, and start building today!&lt;/p&gt;

&lt;h4&gt;
  
  
  Connect with us
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Shir Meir Lador → &lt;a href="https://www.linkedin.com/in/shirmeirlador/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;, &lt;a href="https://x.com/shirmeir86?lang=en" rel="noopener noreferrer"&gt;X&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Ivan Nardini → &lt;a href="https://www.linkedin.com/in/ivan-nardini/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;, &lt;a href="https://x.com/ivnardini" rel="noopener noreferrer"&gt;X&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Prateek Dudeja → &lt;a href="https://www.linkedin.com/in/prateek-dudeja/" rel="noopener noreferrer"&gt;Linkedin&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>security</category>
      <category>ai</category>
      <category>ecommerce</category>
    </item>
    <item>
      <title>The Agent Factory podcast: 5 Episodes to Kickstart Your Journey to Production AI</title>
      <dc:creator>Shir Meir Lador</dc:creator>
      <pubDate>Tue, 25 Nov 2025 21:22:16 +0000</pubDate>
      <link>https://dev.to/googleai/the-agent-factory-podcast-5-episodes-to-kickstart-your-journey-to-production-ai-35ml</link>
      <guid>https://dev.to/googleai/the-agent-factory-podcast-5-episodes-to-kickstart-your-journey-to-production-ai-35ml</guid>
      <description>&lt;p&gt;We are so proud to announce that a project we're incredibly passionate about has grown into a full-blown resource for developers: The Agent Factory video podcast.&lt;/p&gt;

&lt;p&gt;We started this show with a simple mission: to have the conversations developers need to be having about AI agents development. We wanted to move past the hype and focus on what really matters—building production-ready AI agents.&lt;/p&gt;

&lt;p&gt;Fast forward to today, and we have &lt;a href="https://www.youtube.com/playlist?list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs" rel="noopener noreferrer"&gt;14 episodes&lt;/a&gt; published covering everything from architecture patterns to end to end vibe coding of advanced AI applications. To celebrate, we’re sharing our first 5 foundational episodes with the Dev.to community. If you are just starting to build agents or looking to harden your existing systems, this is the perfect place to start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What to Expect:&lt;/strong&gt;&lt;br&gt;
We pack every episode with three core segments designed for developers:&lt;/p&gt;

&lt;p&gt;🎙️ &lt;strong&gt;Agent Industry Pulse:&lt;/strong&gt; We filter the noise and bring you the latest news you actually need to know.&lt;/p&gt;

&lt;p&gt;🛠️ &lt;strong&gt;The Factory Floor:&lt;/strong&gt; A technical deep-dive where we get our hands dirty with code, architectures, and patterns.&lt;/p&gt;

&lt;p&gt;❓ &lt;strong&gt;Developer Q&amp;amp;A:&lt;/strong&gt; We answer real questions from the community to help us learn together.&lt;/p&gt;

&lt;p&gt;📺 &lt;strong&gt;The Starter Pack: Our First 5 Episodes&lt;/strong&gt;&lt;br&gt;
Here is the chronological journey to get you up to speed, starting from the very beginning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Agents, their frameworks and when to use them (ft. Julia Wiesinger)&lt;/strong&gt; &lt;br&gt;
We kicked things off by tackling the big questions: What exactly is an agent? How do you choose between frameworks like LangChain, CrewAI, or the Agent Development Kit (ADK)? We were joined by Julia Wiesinger from the ADK team to guide us through building for production. 

  &lt;iframe src="https://www.youtube.com/embed/aLYrV61rJG4"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Multi-Agent Systems: Concepts &amp;amp; Patterns&lt;/strong&gt;&lt;br&gt;
Single agent or multi-agent? In this episode, we break down the architectural patterns that matter - from Supervisors to Swarms. We discuss exactly when you should transition from a single agent to a team of agents to handle complexity and improve reliability. 

  &lt;iframe src="https://www.youtube.com/embed/TGNScswE0kU"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Building Custom Tools for Agents&lt;/strong&gt;&lt;br&gt;
Agents are only as good as the tools they can use. We dive into Model Context Protocol (MCP), function calling, and how to build secure, authenticated tools that let your agents interact with the real world safely. 

  &lt;iframe src="https://www.youtube.com/embed/NiLb5DK4_rU"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Memory in Agents (ft. Kimberly Milam)&lt;/strong&gt;&lt;br&gt;
How do you stop your agent from acting like a goldfish? We chat with Kimberly Milam about implementing long-term memory, managing state, and the "Memory Bank" concept to create personalized experiences that persist across sessions. 

  &lt;iframe src="https://www.youtube.com/embed/2yW7aTfjo88"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Tackling the Hardest Questions (ft. Philipp Schmidt)&lt;/strong&gt;&lt;br&gt;
We sat down with Philipp Schmidt from Google DeepMind for a masterclass on the agent development workflow. We cover context engineering, evaluation strategies, and pro-tips for using the Gemini CLI to speed up your development cycle. 

  &lt;iframe src="https://www.youtube.com/embed/kPVZQ3ae7-8"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;💬 Join the Conversation&lt;/strong&gt;&lt;br&gt;
We’re truly excited to continue building this community with you. Whether you're stuck on a specific bug or wondering about a new architecture, we want to hear from you.&lt;/p&gt;

&lt;p&gt;What are you struggling with right now? Drop your questions in the comments below with &lt;strong&gt;#TheAgentFactory&lt;/strong&gt;, and we might answer them in our next Q&amp;amp;A segment!&lt;/p&gt;

&lt;p&gt;➡️ &lt;strong&gt;Listen &amp;amp; Subscribe: &lt;a href="https://www.youtube.com/googlecloudplatform" rel="noopener noreferrer"&gt;Google Cloud Tech&lt;/a&gt; on YouTube&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>architecture</category>
      <category>beginners</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
