DEV Community: Shir Meir Lador

A Guide to AI Cold Starts on Cloud Run

Shir Meir Lador — Fri, 26 Jun 2026 16:50:58 +0000

I saw a developer asking on Reddit if there was any “sane way” to manage Cloud Run cold starts for AI across multiple regions. They were experiencing startup latencies of up to 20 seconds, a frustrating gap where the infrastructure is spinning up while the user waits for a response.

The discussion was full of developers who had almost given up on serverless GPUs, with some even migrating back to GKE just to escape the latency. I decided it was time to dive deep into the Mechanics of AI Cold Starts and see if we could find that "sane way."

During my research into hosting models like Gemma 4 on Cloud Run, I had the privilege of co-presenting at Google Cloud Next '26 with Oded Shahar (Senior Engineering Manager for Cloud Run) and our guest speaker Ajay Nair (Global VP of Platform at Elastic).

In our session, "Build AI architectures with custom models on Cloud Run," Ajay shared the production-hardened strategies that allow Elastic to serve millions of daily requests across 17+ model variants, all while maintaining the 'scale-to-zero' efficiency of Cloud Run.

Build AI architectures with custom models on Cloud Run

Ajay showed us that the secret isn't just in the model, but in treating GPUs as fungible compute rather than infrastructure to manage.

I realized then that minimizing cold start latency isn't just about the model, it's about the infrastructure patterns and architectural decisions that keep it fast, scalable, and secure.

The anatomy of an AI cold start

As the official Google Cloud GPU best practices explain, an AI cold start is a shift from standard web microservices. You aren't just booting code, you're moving gigabytes of weights into a specialized physical accelerator.

Think of it as a four-phase race. If you don't optimize each step, you're going to lose your users.

Phase 1: Infrastructure Provisioning (~5s)

Cloud Run allocates the physical GPU and injects pre-installed NVIDIA drivers. Since Google manages the drivers for you, you don't have to bloat your Dockerfile.

Phase 2: Block-Level Container Image Streaming (1-2s)

Cloud Run uses "image streaming," meaning it pulls only the blocks needed to boot. Your 15GB CUDA image can actually start as fast as a tiny Node.js app!

Phase 3: Engine Initialization (5-15s)

This is where your inference engine (vLLM, Ollama) warms up. This is a massive CPU-heavy task, and it's where most people get throttled without realizing it.

Phase 4: Model Loading & VRAM Transfer

This is the final hurdle - moving those model weights from storage into the GPU memory. Unlike standard web apps where CPU is king, GPU memory is your primary constraint here. If your model’s weights don’t fit entirely within the GPU memory, performance degrades significantly as it swaps to slower system RAM.

Best practices to handling AI cold starts

To build a "sane" production environment, here are a few crucial levers you can pull, informed by the official Google Cloud documentation on AI inference with GPUs.

Optimize Phase 4

Pick the Right Deployment Option

Phase 4 is the "final hurdle" where you move gigabytes of weights from storage into GPU memory. Your choice of storage determines how fast this transfer happens:

Cloud Storage (Concurrent Download) - Fastest: Using the Google Cloud CLI (gcloud storage cp) allows you to download model files in parallel. This is the recommended method for massive weights because it maximizes network throughput and drastically reduces transfer time.
Cloud Storage (FUSE) - Easiest: This provides "zero-code" changes by mounting a bucket as a local file system. However, because it does not parallelize the initial download, it is significantly slower for large model weights
Container Image - Best for <10GB: Baking weights into your image is efficient for smaller models thanks to Cloud Run's Image Streaming. For models over 10GB, however, the import and streaming overhead can become a bottleneck.
Internet: Avoid this. It is the slowest and least predictable path for production inference.

Model Format & Size

Optimizing your model's format and size is a direct "hack" to shorten Phase 4 (Model Loading & VRAM Transfer). Because this phase is constrained by how fast you can move gigabytes of data into VRAM, smaller and more efficient files are critical.

4-bit Quantization: This is the ultimate cold start hack. Smaller weights mean fewer gigabytes to pull from storage, which directly accelerates the download and transfer portion of Phase 4,
Fast Formats: Pick a model format with fast load times like GGUF to minimize startup time. For the fastest performance, move away from Python "pickle" files and use Safetensors for zero-copy loading.
Ensure VRAM Fit: Use quantized models to ensure the weights fit entirely within the GPU memory. If the model exceeds VRAM, Phase 4 will stall as the system swaps to significantly slower RAM.

Optimize Phases 3 & 4: Infrastructure & Network Levers

These infrastructure settings provide the necessary resources to accelerate the most demanding parts of the startup process.

Startup CPU Boost (Accelerates Phase 3)

This feature temporarily doubles your CPU power during startup. A 1 vCPU instance boosts to 2 vCPUs for the duration of startup and the first 10 seconds of serving. It is essential for Phase 3, as engine initialization is a massive CPU-heavy task.

Direct VPC Egress & PGA (Accelerates Phase 4)

Utilizing Direct VPC Egress with Private Google Access (PGA) ensures your model weight traffic stays on Google’s internal high-speed backbone. This optimizes the network path to shorten the time spent moving gigabytes of weights into VRAM.

Concurrency Tuning (Cold Start Avoidance):

In Cloud Run, "concurrency" refers to the maximum number of requests a single instance can handle before the platform scales out to start a new one. For AI workloads, you must tune this setting in tandem with your model engine's internal parallelism flags (e.g., --max-num-seqs for vLLM or OLLAMA_NUM_PARALLEL for Ollama).

Use the official Google Cloud formula to find your ideal Cloud Run concurrency:

(Number of model instances∗parallel queries per model)+(number of model instances∗ideal batch size)

Example: If your instance loads 3 model instances onto the GPU, and each model instance can handle 4 parallel queries with an ideal batch size of 4, you would set your Cloud Run maximum concurrent requests to 24: (3×4)+(3×4)

How the math works: The goal is to keep the GPU fully saturated while ensuring users aren't stuck in a long queue. In this example, the total of 24 concurrent requests is split into two functional groups:

Active Processing (12 requests): Calculated as (3 instances×4 queries), this represents the total number of requests the GPU can actively process at any given moment.
The "Next Batch" Buffer (12 requests): Calculated as (3 instances×4 batch size), these are the requests waiting "on deck" inside the container. As soon as the GPU finishes the first batch, it immediately picks up these waiting requests.

By tuning this value as high as your VRAM allows (usually 10-20 users), one warm instance can serve many requests without triggering a new scale-out event and the cold start that comes with it.

Scaling Controls (Tuning the Threshold)

While the formula above defines your maximum capacity, you can also tune when Cloud Run decides to start the next instance. Cloud Run's autoscaler typically targets 60% utilization, but for long-running AI cold starts, you can increase this threshold to 80% or 90% via Scaling Controls.

Concurrency Target: Increasing this allows you to "pack" more requests into a single warm instance before triggering a scale-out.
CPU Target: Increasing the CPU target prevents the platform from starting a new instance just because initialization or high-intensity inference spiked the CPU utilization.

Scaling & Reliability Strategies

Sometimes the best way to handle a cold start is to avoid it entirely or manage it proactively.

The Single-Region "Always-On" Tradeoff

If you are deploying globally, the cost of keeping minimum instances set to 1 in every region adds up. Instead, consider an 'always-on' service in just one region. A 100ms global network delay is a much better user experience than a 20s local cold start.

The 15-Minute Grace Period: A common question is 'How long will my instance stay warm after a request?' Cloud Run generally keeps instances alive for 15 minutes after they become idle (processing zero requests). If your traffic is predictable and comes in every 10–12 minutes, you might not even need an 'always-on' service, the platform’s default shutdown policy will keep a warm instance ready for your next user.

Note: While this idle time is "free" for standard request-based services, remember that GPU services require instance-based billing, so you will be billed for the duration the instance remains warm between requests.

The "Wake-Up Call" Strategy

Sometimes the best way to handle a cold start is to proactively mask it. If your UI can predict an upcoming request, for example, when a user clicks "New Chat" or begins hovering over a text area, you can send a lightweight health check to your service immediately. By the time the user finishes typing their prompt, the first two phases of the cold start (Infrastructure Provisioning and Container Image Streaming) are already finished in the background.

Pro-Tip: Use Non-Inference Endpoints To make this "wake-up call" as fast as possible, always use a non-inference endpoint rather than sending a dummy prompt like "hi".

Why it’s faster: Non-inference endpoints (like /v1/models for vLLM or /api/tagsfor Ollama) are handled by the container’s web server the moment it starts. They don’t have to wait for the slow "Phase 4" model loading and VRAM transfer to complete before sending a success response.
No Chat Pollution: Because these endpoints don't trigger the model's completion logic, they won't interfere with the user's actual chat history or accidentally trigger session creation in your backend.

Recommended Endpoints:

vLLM: GET /health or GET GET /v1/models
Ollama: GET /api/tags or GET /api/version

Tune Startup Probes for VRAM

AI models take significant time to move gigabytes of weights from storage into GPU memory (Phase 4). If your startup check fails too many times, Cloud Run will assume your container is broken and kill it.

To prevent this:

Increase the Failure Threshold: Use a high failureThreshold (e.g., 60 or more). Since the total allowed startup time is the product of failureThreshold \times periodSeconds, a threshold of 60 with a 5-second period gives your model a healthy 5-minute window to load.
Utilize the 30-Minute Maximum: While standard services are limited to 4 minutes, Cloud Run supports a total startup time of up to 30 minutes (1,800 seconds) for intensive workloads.
Avoid False Positives (The Ollama Fix): Be careful with engines like Ollama, which may open a TCP port as soon as the service starts, but before the model is actually in VRAM. Always ensure you are preloading models during the container's entrypoint script to ensure the startup probe only passes once the model is truly ready for inference.

Lessons from Elastic’s strategy

In our NEXT ‘26 session, Ajay Nair highlighted three architectural decisions that allowed Elastic to treat GPUs as fungible compute, rather than infrastructure to manage:

Bypass the Compilation Tax: By setting enforce_eager=True in vLLM, they traded a tiny bit of throughput for cold starts that finish in less than a minute rather than multiple minutes.
Standalone Checkpoints: They avoided the latency of runtime adapter-switching by pre-merging each LoRA variant into a standalone checkpoint.
One Workload, One Service: Each independently-scalable workload — defined by model, task adapter, and traffic shape — is deployed as its own Cloud Run service. This produces 30+ services across ~15 model families, with some models split by task (e.g., v5 retrieval vs. clustering) or by query/passage role.

Ready to get started?

Optimizing the cold start process is the difference between a hobby project and a production-ready application. The best part? Cloud Run handles the NVIDIA driver and CUDA installation for you, starting the instance in about 5 seconds.

For a deeper dive, the official documentation is your best friend:

For the full technical breakdown, I highly recommend watching the recording of the session from Google Cloud Next '26. It provides the most comprehensive blueprint for hosting high-performance open models on serverless infrastructure.

Happy building!

Special thanks to Sara Ford and Shane Ouchi from the Cloud Run team and to Zac Li from Elastic for the helpful review and feedback on this article.

Can Google Antigravity 2.0 Pass the "Napkin Challenge"? 📝🚀

Shir Meir Lador — Wed, 27 May 2026 00:02:51 +0000

🚀 From Napkin Sketch to data-driven real-estate advisor Agent in Under 40 Minutes? 🚀

Can a coding agent really work autonomously on complicated problems without human intervention? I decided to put Google’s new Antigravity 2.0 and Gemini 3.5 to the test: The Napkin Challenge. 📝

The goal: Build and deploy a real estate investment advisor - based on real-world data, starting with nothing but a rough sketch on a napkin.

The Result? With the right context, yes!

I used:
Antigravity 2.0 & Gemini 3.5: Architect and execute the plan
Agent CLI Skills: Scaffold, build, test and evaluate the agent using ADK.3
Developer Knowledge MCP: Provided necessary context to connect to BigQuery MCP and integrate the census dataset for grounded investment advice.
Parallelized Workflow using sub-agents: While the system ran the evaluation suite, it simultaneously deployed the agent to Cloud Run.

The Outcome:

Built and deployed in under 40 minutes.
Zero human input beyond a napkin sketch (other than approving the plan and setting the model and region).
100% passing scores on evaluation cases, with responses delivered in under 30 seconds.
A fully functional real estate advisor providing data-driven analysis on short and long term investments.

The "Napkin Challenge" proves that when you combine the right context with powerful models, the barrier between an idea and a deployed product has virtually disappeared.

I challenge you: What is your "napkin" project? Try it with Antigravity 2.0 and Gemini 3.5. 📝

🚀 How to join the challenge -

Sketch an architecture or app idea on a literal napkin.
Feed it to Antigravity 2.0 + Gemini 3.5. (I recommend making sure you have relevant skills in place for the task - context makes all the difference!)
Drop a quick video or screenshot of your results on socials with the hashtag #NapkinChallenge.
Add the link to your demo in the comments to this post!
Tag 3 other folks and give them 48 hours to match the challenge! 👇 What will you build? Let me know in the comments!

From Code to Cloud: 3 Labs for Deploying Your AI Agent

Shir Meir Lador — Thu, 21 May 2026 15:49:59 +0000

You've built a powerful AI agent. It works on your local machine, it's intelligent, and it's ready to meet the world. Now, how do you take this agent from a script on your laptop to a secure, scalable, and reliable application in production? On Google Cloud, there are multiple paths to deployment, each offering a different developer experience.

If you are looking for a detailed architectural comparison to help you choose between Cloud Run, Google Kubernetes Engine (GKE), and Vertex AI Agent Engine, you can start by reading Choosing the Right Deployment Path for Your Google ADK Agents.

Ready to build? As part of our Production-Ready AI on Google Cloud Learning Path, we've created three distinct hands-on labs to help you experience these deployment options for yourself.

The Managed Solution: Vertex AI Agent Engine

For teams seeking the simplest path to production, Vertex AI Agent Engine removes the need to manage web servers or containers entirely. It provides an opinionated environment optimized for python agents, where you define the agent's logic, and the platform handles the execution, memory, and tool invocation.

Start the lab!
Lab: Create multi agent system with ADK, deploy in Agent Engine and get started with A2A protocol

Objective: Deploy a multi-agent system without provisioning any infrastructure, leveraging built-in capabilities for state management and reasoning while Google manages the runtime.

The Serverless Experience: Cloud Run

For teams that want the flexibility of containers without the operational overhead, Cloud Run abstracts away the infrastructure, allowing you to deploy your agent as a container that automatically scales up when busy and down to zero when quiet.

This path is particularly powerful if you need to build in languages other than Python, use custom frameworks, or integrate your agent into existing declarative CI/CD pipelines.

Start the lab!
Lab: Build and deploy an ADK agent on Cloud Run

Objective: Containerize a tool-using agent with the Google Cloud ADK and deploy it to a secure public HTTPS endpoint to experience the speed of a serverless workflow.

The Orchestrated Experience: Google Kubernetes Engine (GKE)

For teams that need precise configuration over their environment, GKE is designed to manage that complexity. This path shows you how an AI agent functions not just as a script, but as a microservice within a broader orchestrated cluster.

Start the lab!
Lab: Deploy ADK agents to Google Kubernetes Engine (GKE)

Objective: Deploy an ADK agent into a managed Kubernetes cluster, configuring autoscaling and precise resource limits using industry-standard tooling and manifests.

Your Path to Production

Whether you are looking for serverless speed, orchestrated control, or a fully managed runtime, these labs provide the blueprint to get you there.

These labs are part of the Deploying Agents module in our official Production-Ready AI with Google Cloud program. Explore the full curriculum for more content that will help you bridge the gap from a promising prototype to a production-grade AI application.

Share your progress and connect with others on the journey using the hashtag #ProductionReadyAI. Happy learning!

Agent Factory Recap: How Gemma 4 Taught Itself Physics

Shir Meir Lador — Thu, 14 May 2026 14:10:49 +0000

In this episode of The Agent Factory, Vlad Kolesnikov and I sat down with Omar Sanseviero from the Developer Experience team at Google DeepMind. We explored the groundbreaking release of Gemma 4: a new family of open models designed to bring high-level intelligence and agentic capabilities directly to consumer hardware and mobile devices. Since the launch last month, Gemma 4 had over 50 million downloads!

This post guides you through the key ideas from our conversation. Use it to quickly recap topics or dive deeper into specific segments with links and timestamps.

Gemma 4 - What is it?

Gemma 4 is the latest generation of open models from Google DeepMind, built on the same foundational research as Gemini 3. The family is designed to deliver exceptional "intelligence per parameter" across a range of deployment scenarios, from mobile phones to powerful workstations. The Gemma 4 model family now spans three distinct architectures:

Small Sizes (E2B & E4B): Optimized for ultra-mobile, edge, and browser deployment (such as Pixel or Chrome).
Dense (31B): A powerful 31-billion parameter model that provides server-grade performance for local execution on consumer GPUs.
Mixture-of-Experts (26B MoE): A highly efficient architecture designed for high-throughput tasks and advanced reasoning.

With the shift to an Apache 2 license, these models provide developers and startups with the flexibility to build, modify, and commercialize applications while maintaining full control over their infrastructure.

Omar Sanseviero on how Gemma 4 changes the landscape for agent developers

Timestamp: 1:40

Omar highlighted that Gemma 4 brings "very high intelligence per parameter," making it possible to run agentic workflows entirely offline. We saw examples of multiple Gemma instances running locally to generate SVGs (1:53) and an Android-based agent picking specific skills, like playing the piano, to complete tasks (2:45). As Omar noted, "This means that you can run very powerful things with very little hardware overhead...even in the phone that you have in your pocket."

The Factory Floor

Building a Local Food Tour Agent

Timestamp: 5:29

We showcased a food tour agent powered by Gemma 4 using the Agent Development Kit (ADK) and a Google Maps MCP server. We demonstrated how a local model can handle complex, multi-step reasoning tasks.

The agent identified the best ramen spots in Seattle under a $30 budget.
It verified that the locations were within walking distance of each other.
It processed search results to provide specific tips on what to order and what to avoid.

Autonomous Python Code Execution

Timestamp: 8:03

In this demo, we pushed Gemma 4's coding capabilities to the limit by asking it to express itself through animation. Using a sandbox execution environment, the model performed the following:

Wrote Python code using the Matplotlib library.
Attempted to build a physics engine to simulate a bouncing ball.
Self-corrected when the initial execution environment lacked certain CPU features, finding an alternative path to successfully generate the animation.
Demonstrated a deep understanding of real-world physics and gravity through code.

The Shift to Apache 2 Licensing

Timestamp: 4:05

A major theme of the conversation was the community-driven decision to move Gemma 4 to an Apache 2 license. This change provides developers and startups with maximum flexibility to build, modify, and commercialize applications. Omar emphasized that this was a direct response to developer feedback, aiming to unlock a new wave of innovation in the open models ecosystem.

Developer Q&A

Architectural Decisions and Mixture of Experts (MoE)

Timestamp: 17:23

Omar explained the technical shifts that make Gemma 4 so efficient. For the first time, the Gemma family includes a Mixture of Experts (MoE) architecture, which optimizes for extremely low latency in production. Additionally, the smaller E2B and E4B models utilize per-layer embeddings to remain "cheap" to run on GPUs. For vision tasks, the model now supports variable aspect ratios, allowing it to understand images of various sizes more accurately than previous fixed-resolution versions.

Comparing Gemma to Gemini

Timestamp: 19:51

When asked how Gemma stacks up against its larger sibling, Gemini, Omar clarified that they serve different purposes. While Gemini excels at massive-scale tasks and deep "world knowledge" due to its size, Gemma is the "best open model that can run on a single consumer GPU." It is specifically optimized for instruction following, coding, and agentic use cases where local deployment or fine-tuning is required.

Fine-Tuning for Specialized Industries

Timestamp: 21:10

The conversation touched on the importance of "Sovereign AI" and privacy. Because Gemma is an open model, developers in regulated industries, like healthcare or finance, can fine-tune the model on their private data and deploy it within their own air-gapped infrastructure. This gives developers full control over their data and the model's specialized expertise.

Conclusion

Gemma 4 marks a turning point for agentic development, proving that you don't always need a massive cloud cluster to build something smart. Whether it's running a physics simulation on a laptop or a travel guide on a phone, the barrier to entry for high-performance AI has never been lower. We are entering an era where the "conductor" of the AI orchestra can be any developer with a single GPU and a great idea.

Your turn to build

Now that you've seen what Gemma 4 can do, it's time to start building. Check out the resources in our show notes, the food tour agent, the coding agent, explore the ADK support, and try running Gemma 4 on your local machine or on Cloud Run. We can't wait to see what agents you create!

Watch more of The Agent Factory → Reinforcement learning & fine-tuning on TP...

Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech

Connect with us

Shir Meir Lador → LinkedIn, X
Vlad Kolesnikov → LinkedIn, X
Omar Sanseviero → LinkedIn, X

Deploying a Multi-Agent System with Terraform and Cloud Run

Shir Meir Lador — Thu, 07 May 2026 21:04:12 +0000

In support of our mission to accelerate the developer journey on Google Cloud, we built Dev Signal: a multi-agent system designed to transform raw community signals into reliable technical guidance by automating the path from discovery to expert creation.

In the first three parts of this series, we laid the essential groundwork by establishing its core capabilities and local verification process:

In part 1, we standardize the agent's capabilities through the Model Context Protocol (MCP), connecting it to Reddit for trend discovery and Google Cloud Docs for technical grounding. In part 2, we built a multi-agent architecture and integrated the Vertex AI memory bank to allow the system to learn and persist user preferences across different conversations. In part 3, we verified the full end-to-end lifecycle locally using a dedicated test runner to ensure that research, content creation, and cloud-based memory retrieval were perfectly synchronized.

If you'd like to dive straight into the code, you can clone the repository here.

Deployment to Cloud Run and the Path to Production

To help you transition from this local prototype to a production service, this final part focuses on building the production backbone of your agent using the foundational deployment patterns provided by the Agent Starter Pack. We will implement the essential structural components required for monitoring, data integrity, and long-term state management in the cloud. You will learn to implement the application server and helper utilities needed for a production-ready deployment before provisioning secure, reproducible infrastructure with Terraform.

While the Dockerfile packages your agent's code and its specialized dependencies, such as Node.js for the Reddit MCP tool, Terraform is used to build the platform it lives on. Terraform automates the creation of your Artifact Registry, least-privilege service accounts, and Secret Manager integrations to ensure your API keys remain protected.

By the end of this part, you will have a standardized application framework deployed on Google Cloud Run and a roadmap for graduating your prototype through continuous evaluation, CI/CD and advanced observability.

Production Utilities and Server: Building the System's Body

In this section, you implement the structural components required for monitoring and long-term state management in the cloud.

The Application Server: Initializing the FastAPI server and establishing a vital connection to the Vertex AI memory bank.
Implementing Telemetry: Enabling 'Agent Traces' for visibility into internal reasoning.

The Application Server

The fast_api_app.py file serves as the vital entry point for your agent, transforming the core logic into a production FastAPI server that acts as the "body" of your system. When deploying to Cloud Run, this server is essential because it provides the necessary web interface to listen for incoming HTTP requests and dispatch them to the agent for processing. Beyond basic serving, its most critical role is establishing a connection to the Vertex AI memory bank by defining a MEMORY_URI, which allows the ADK framework to persist and retrieve user preferences across different production sessions. Additionally, the application server initializes production-grade telemetry for real-time monitoring.

Go back to the dev_signal_agent folder.

cd ..

Paste the following code in dev_signal_agent/fast_api_app.py:

import os
from fastapi import FastAPI
from google.adk.cli.fast_api import get_fast_api_app
from google.cloud import logging as cloud_logging
from vertexai import agent_engines
from dev_signal_agent.app_utils.env import init_environment

# --- Initialization & Secure Secret Retrieval ---
# We now unpack the SECRETS dictionary returned by our updated env.py
PROJECT_ID, MODEL_LOC, SERVICE_LOC, SECRETS = init_environment()
logger = cloud_logging.Client().logger(__name__)

# Access sensitive credentials from the SECRETS dictionary
# These keys stay in memory and are NOT injected into os.environ
REDDIT_CLIENT_ID = SECRETS.get("REDDIT_CLIENT_ID")
REDDIT_CLIENT_SECRET = SECRETS.get("REDDIT_CLIENT_SECRET")
REDDIT_USER_AGENT = SECRETS.get("REDDIT_USER_AGENT")
DK_API_KEY = SECRETS.get("DK_API_KEY")

# --- Configuration & Sessions ---
AGENT_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
# Non-sensitive configuration uses environment variables
BUCKET = os.environ.get("AI_ASSETS_BUCKET")
USE_IN_MEMORY = os.environ.get("USE_IN_MEMORY_SESSION", "").lower() in ("true", "1")

# --- MEMORY BANK CONNECTION ---
def _get_memory_bank_uri():
    if USE_IN_MEMORY: return None, None
    # We use 'dev_signal_agent' as the display name for the Vertex AI memory bank
    name = os.environ.get("AGENT_ENGINE_MEMORY_BANK_NAME", "dev_signal_agent")
    existing = list(agent_engines.list(filter=f"display_name={name}"))
    ae = existing[0] if existing else agent_engines.create(display_name=name)
    uri = f"agentengine://{ae.resource_name}"
    print(f"DEBUG: Connecting to Memory Bank: {uri} (display_name={name})")
    return uri, uri

SESSION_URI, MEMORY_URI = _get_memory_bank_uri()

# --- Initialize FastAPI with ADK ---
app: FastAPI = get_fast_api_app(
    agents_dir=AGENT_DIR,
    web=True,
    artifact_service_uri=f"gs://{BUCKET}" if BUCKET else None,
    allow_origins=os.getenv("ALLOW_ORIGINS", "").split(",") if os.getenv("ALLOW_ORIGINS") else None,
    session_service_uri=SESSION_URI,
    memory_service_uri=MEMORY_URI, # <--- Connects the Memory Bank
    otel_to_cloud=True, # <--- Enables production telemetry
)

if __name__ == "__main__":
    import uvicorn
    # Standard Cloud Run port is 8080
    uvicorn.run(app, host="0.0.0.0", port=8080)

Implementing Telemetry

In a production environment, visibility into your agent's reasoning is critical. We leverage the built-in observability features of the Google ADK by setting the otel_to_cloud=True flag in our application server. This single parameter handles the majority of the instrumentation automatically, exporting "Agent Traces" directly to the Google Cloud Console. These traces provide a "visual waterfall" of the agent's operation, including individual agent thought processes, LLM invocations, and MCP tool calls.

Monitoring vs. Targeted Evaluation

It is essential to understand that production tracing is subject to sampling to balance performance and cost. Because Cloud Run captures only a subset of requests, not every individual user interaction will be visible.

System Traces (Monitoring): Used to analyze behavior "at large," such as identifying latency bottlenecks or system timeouts.
Reasoning Traces (Evaluation): High-quality evaluation mandates targeted trace capture. This means calling the agent specifically for a test case where you know you will evaluate that particular request in full detail.

Viewing the Trace

To see your traces, navigate to the Trace Explorer in the Google Cloud Console and filter for your service (e.g., dev-signal). Clicking a specific Trace ID opens a Gantt chart that allows you to distinguish between cognitive reasoning failures (wrong decisions) and physical system issues (timeouts).

For advanced configurations, refer to the following documentation:

Infrastructure as Code: Provisioning Secure Cloud Resources

We utilize the infrastructure-as-code patterns provided by the Agent Starter Pack's security-first design. The starter pack builds the professional platform required to automate the creation of least-privilege service accounts and robust secret management in seconds.

Using Terraform ensures that your entire Google Cloud environment - from IAM roles to Secret Manager versions - is defined in reproducible, secure code. We break our infrastructure into the following logical blocks:

Resources & Variables: Define the specific project, region, and sensitive API secrets used by the agent.
Core Infrastructure: Enable essential APIs and provision a private Artifact Registry to host your agent's container images.
Identity & Access Management (IAM): Configure specialized Service Accounts that strictly follow the Principle of Least Privilege to ensure your system remains secure.
Secret Management: Securely ingest API credentials into Google Secret Manager for protected runtime access.
Cloud Run Configuration: Define the container environment, resource limits, and automated secret injection for the final deployment.

To begin provisioning, return to the root folder of your project (dev-signal) and create the necessary deployment directories:

cd ..
mkdir deployment
cd deployment
mkdir terraform
cd terraform

Terraform Resources and Variables

The variables.tf file defines the configurable parameters for your deployment, allowing you to customize the infrastructure without altering the underlying logic. It includes variables for the project_id, the deployment region (defaulting to us-central1), and the service_name for your Cloud Run instance. Furthermore, it defines a secrets map used to securely ingest sensitive API credentials—such as Reddit and Developer Knowledge keys—into Google Secret Manager for runtime access. This modular approach ensures your production environment remains reproducible, secure, and adaptable across different projects.

Paste the following code into deployment/terraform/variables.tf:

variable "project_id" {
  description = "The Google Cloud Project ID"
  type        = string
}
variable "region" {
  description = "The Google Cloud region to deploy to"
  type        = string
  default     = "us-central1"
}
variable "service_name" {
  description = "The name of the Cloud Run service"
  type        = string
  default     = "dev-signal"
}
variable "secrets" {
  description = "A map of secret names and their values (e.g., REDDIT_CLIENT_ID, DK_API_KEY)"
  type        = map(string)
  default     = {}
}
variable "ai_assets_bucket" {
  description = "The GCS bucket for storing AI assets"
  type        = string
}

Core Infrastructure Logic

We define our infrastructure in logical blocks. Here is what each part does:

1. Enable APIs: Ensures the project has the necessary services active (Cloud Run, Vertex AI, etc.). We use disable_on_destroy = false to prevent accidental data loss if the Terraform is destroyed.

Paste the following code into deployment/terraform/main.tf:

resource "google_project_service" "services" {
  project = var.project_id
  for_each = toset([
    "run.googleapis.com",
    "artifactregistry.googleapis.com",
    "cloudbuild.googleapis.com",
    "aiplatform.googleapis.com",
    "secretmanager.googleapis.com",
    "logging.googleapis.com"
  ])
  service            = each.key
  disable_on_destroy = false
}

2. Artifact Registry: Creates a private Docker registry to store our agent's container images.

resource "google_artifact_registry_repository" "repo" {
  location      = var.region
  project       = var.project_id
  repository_id = "dev-signal-repo"
  description   = "Docker repository for Dev Signal Agent"
  format        = "DOCKER"
  depends_on    = [google_project_service.services]
}

3. Service Account & IAM: Adhering to the Principle of Least Privilege - This is a critical security step. In accordance with the Principle of Least Privilege, we avoid using the default compute service account and instead provision a dedicated user-managed service account (dev-signal-sa). By designating this as the Cloud Run service identity, we can grant it only the minimum necessary permissions—specifically roles/aiplatform.user, roles/logging.logWriter, and roles/storage.objectAdmin. This granular access control ensures that the agent has the exact permissions required to interact with Vertex AI and Cloud Storage without over-granting access to other sensitive cloud resources, significantly reducing the potential impact of a compromised account. Learn more best practices for using service accounts securely.

resource "google_service_account" "agent_sa" {
  project      = var.project_id
  account_id   = "${var.service_name}-sa"
  display_name = "Dev Signal Agent Service Account"
}

4. Secret Management: This handles your API keys securely. It creates secrets in Google Secret Manager and gives the agent's Service Account permission to access them at runtime.

resource "google_secret_manager_secret" "agent_secrets" {
  project  = var.project_id
  for_each = toset(keys(var.secrets))
  secret_id = each.key
  replication {
    auto {}
  }
  depends_on = [google_project_service.services]
}
resource "google_secret_manager_secret_version" "agent_secrets_version" {
  for_each    = toset(keys(var.secrets))
  secret      = google_secret_manager_secret.agent_secrets[each.key].id
  secret_data = var.secrets[each.key]
}
resource "google_secret_manager_secret_iam_member" "secret_accessor" {
  project  = var.project_id
  for_each = toset(keys(var.secrets))
  secret_id = google_secret_manager_secret.agent_secrets[each.key].id
  role      = "roles/secretmanager.secretAccessor"
  member    = "serviceAccount:${google_service_account.agent_sa.email}"
}

5. Cloud Run Configuration:

Security Best Practice: To satisfy production security standards, our main.tf grants the Service Account the secretmanager.secretAccessor role. Our Python application then uses the Secret Manager SDK to pull these credentials directly into local memory at runtime, ensuring they never touch the container's environment configuration

# 6. Cloud Run Service Deployment
resource "google_cloud_run_v2_service" "default" {
  project  = var.project_id
  name     = var.service_name
  location = var.region
  ingress  = "INGRESS_TRAFFIC_ALL"

  template {
    service_account = google_service_account.agent_sa.email

    containers {
      image = "us-docker.pkg.dev/cloudrun/container/hello" # Placeholder until first build

      env {
        name  = "GOOGLE_CLOUD_PROJECT"
        value = var.project_id
      }
      env {
        name  = "GOOGLE_CLOUD_LOCATION"
        value = "global"
      }
      env {
        name  = "GOOGLE_GENAI_USE_VERTEXAI"
        value = "True"
      }
      env {
        name  = "AI_ASSETS_BUCKET"
        value = var.ai_assets_bucket
      }

      resources {
        limits = {
          cpu    = "1"
          memory = "2Gi"
        }
      }
    }
  }

  traffic {
    type    = "TRAFFIC_TARGET_ALLOCATION_TYPE_LATEST"
    percent = 100
  }

Provision the Infrastructure

Before we can deploy our code, we need to provision the Google Cloud infrastructure we just defined.

Initialize Terraform: This downloads the necessary provider plugins. Run this in deployment/terraform folder:

terraform init

Create a Variables File:

Paste this code in deployment/terraform/terraform.tfvars and update it with your project details and secrets.

project_id       = "your-project-id"
region           = "us-central1"
service_name     = "dev-signal"
ai_assets_bucket = "your-bucket-name"
secrets = {
  REDDIT_CLIENT_ID     = "your_client_id"
  REDDIT_CLIENT_SECRET = "your_client_secret"
  REDDIT_USER_AGENT    = "your_user_agent"
  DK_API_KEY           = "your_dk_api_key"
}

Plan configuration: This allows you to review the changes before they are applied. Run this in the deployment/terraform folder:

terraform plan -out=plan.tfplan

Apply Configuration: Once you have reviewed the plan and confirmed it does what you want, run:

terraform apply plan.tfplan

Deployment: Containerization and the Cloud Build Pipeline

In this final stage of the build process, we package our agent's "body" and "brain" into a portable, production-ready container. This ensures that every component - from our Python logic to the Node.js environment required for the Reddit MCP tool - is bundled together with its exact dependencies.

We utilize a Dockerfile to define this environment and a Makefile to orchestrate the deployment pipeline. When you trigger the deployment, Google Cloud Build takes your local source code, builds the container image according to the Dockerfile, and stores it in the private Artifact Registry created earlier by Terraform. Finally, the pipeline automatically updates your Cloud Run service to serve traffic using this fresh image, completing the journey from local code to a live, secure cloud workload.

Paste this code in dev-signal/Dockerfile:

FROM python:3.12-slim

# Install Node.js and npm for MCP tools (like reddit-mcp)
RUN apt-get update && apt-get install -y \
    curl \
    && curl -fsSL https://deb.nodesource.com/setup_20.x | bash - \
    && apt-get install -y nodejs \
    && npm install -g reddit-mcp \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

RUN pip install --no-cache-dir uv==0.8.13

WORKDIR /code

COPY ./pyproject.toml ./README.md ./uv.lock* ./
COPY ./dev_signal_agent ./dev_signal_agent

RUN uv sync --frozen

EXPOSE 8080

CMD ["uv", "run", "uvicorn", "dev_signal_agent.fast_api_app:app", "--host", "0.0.0.0", "--port", "8080"]

The Makefile automates the build and deploys.

Paste this code in dev-signal/Makefile:

PROJECT_ID ?= $(shell gcloud config get-value project)
REGION     ?= us-central1
IMAGE_REPO ?= dev-signal-repo
IMAGE := $(REGION)-docker.pkg.dev/$(PROJECT_ID)/$(IMAGE_REPO)/agent:latest

# Deploy via Cloud Build & Container
docker-deploy:
    @echo "? Building and deploying to $(PROJECT_ID) via Cloud Build..."
    gcloud builds submit --tag $(IMAGE) --project $(PROJECT_ID) .
    gcloud run services update dev-signal \
        --image $(IMAGE) \
        --region $(REGION) \
        --project $(PROJECT_ID) \
        --labels dev-tutorial=dev-signal-agent

Deploy Application

Now that our infrastructure is ready, we can build and deploy the application code.

Run the following command from the root of your project:

make docker-deploy

What happens when you run this?

Build: Google Cloud Build takes your local code and the Dockerfile, builds a container image, and stores it in the Artifact Registry.
Deploy: It updates the Cloud Run service defined in Terraform to use this new image.

When the deployment completes, you should get a message like this:

Service [dev-signal] revision [dev-signal...] has been deployed and is serving 100 percent of traffic.

Service URL: https://dev-signal-...-.us-central1.run.app

Verification: Accessing and Testing Your Deployed Agent

Since production services are private by default, this section covers how to grant permissions and access the agent securely.

Managing IAM Permissions: Granting the necessary run.invoker role to authorized users.

Secure Access via Cloud Run Proxy: Using the gcloud proxy to interact with your live service.

Granting User Permissions

Before you can invoke the service, you must grant your Google account the roles/run.invoker role for this specific service. Run the following command:

gcloud run services add-iam-policy-binding dev-signal \
  --member="user:$(gcloud config get-value account)" \
  --role="roles/run.invoker" \
  --region=us-central1 \
  --project=$(gcloud config get-value project)

Launch the Proxy

Now, access your private service securely via the proxy:

gcloud run services proxy dev-signal \
  --region us-central1 \
  --project $(gcloud config get-value project)

Visit http://localhost:8080 to chat with your deployed agent! See a possible test scenario in part 3 of the series.

Summary

Congratulations! You have successfully built Dev Signal.

What we covered:

Tooling (MCP): You connected your agent to Reddit, Google Docs, and a Local Image Generator using the Model Context Protocol.
Architecture: You implemented a Root Orchestrator managing specialized agents (Scanner, Expert, Drafter).
Memory: You integrated Vertex AI memory bank to give your agent long-term persistence across sessions.
Production: You deployed the entire stack to Google Cloud Run using Terraform for secure, reproducible infrastructure.

You now have a solid foundation for building sophisticated, stateful AI applications on Google Cloud.

Local Testing of a Multi-Agent System with Memory

Shir Meir Lador — Thu, 07 May 2026 21:03:03 +0000

In part 1 and part 2 of this series, we established the essential groundwork by standardizing the core capabilities through the Model Context Protocol (MCP) and constructing a multi-agent architecture integrated with the Vertex AI memory bank to provide long-term intelligence and persistence. Now, we'll explore how to test your multi-agent system locally!

If you'd like to dive straight into the code and explore it at your own pace, you can clone the repository here.

Testing the Agent Locally

Before transitioning your agentic system to Google Cloud Run, it is essential to ensure that its specialized components work seamlessly together on your workstation. This testing phase allows you to validate trend discovery, technical grounding, and creative drafting within a local feedback loop, saving time and resources during the development process.

In this section, you will configure your local secrets, implement environment-aware utilities, and use a dedicated test runner to verify that Dev Signal can correctly retrieve user preferences from the Vertex AI memory bank on the cloud. This local verification ensures that your agent's "brain" and "hands" are properly synchronized before moving to deployment.

Environment Setup

Create a .env file in your project root. These variables are used for local development and will be replaced by Terraform/Secret Manager in production.

Paste this code in dev-signal/.env and update with your own details.

Note: GOOGLE_CLOUD_LOCATION is set as global because that is where gemini-3-flash-preview is supported. We will use GOOGLE_CLOUD_LOCATION for the model location.

# Google Cloud Configuration
GOOGLE_CLOUD_PROJECT=your-project-id
GOOGLE_CLOUD_LOCATION=global
GOOGLE_CLOUD_REGION=us-central1
GOOGLE_GENAI_USE_VERTEXAI=True
AI_ASSETS_BUCKET=your_bucket_name

# Reddit API Credentials
REDDIT_CLIENT_ID=your_client_id
REDDIT_CLIENT_SECRET=your_client_secret
REDDIT_USER_AGENT=my-agent/0.1

# Developer Knowledge API Key
DK_API_KEY=your_api_key

Helper Utilities

Create a new directory for your application utils:

cd dev_signal_agent
mkdir app_utils
cd app_utils

Environment Configuration

This module standardizes how the agent discovers the active Google Cloud Project and Region, ensuring a seamless transition between development environments. Using load_dotenv(), the script first checks for local configurations before falling back to google.auth.default() or environment variables to retrieve the Project ID. This automated approach ensures your agent is properly authenticated and grounded in the correct cloud context without requiring manual configuration changes.

Beyond basic project discovery, the script provides a robust Secret Management layer. It attempts to resolve sensitive credentials, such as Reddit API keys, first from the local environment (for rapid development) and then dynamically from the Google Cloud Secret Manager API for production security. By returning these as a dictionary rather than injecting them into environment variables, the module maintains a clean security posture.

The script further calibrates the environment by distinguishing between global and regional requirements for different AI services. It specifically assigns the "global" location for models to access cutting-edge preview features while designating a regional location, such as us-central1, for infrastructure like the Vertex AI Agent Engine.

Paste this code in dev_signal_agent/app_utils/env.py:

import os
import google.auth
import vertexai
from google.cloud import secretmanager
from dotenv import load_dotenv

def _fetch_secrets(project_id: str):
    """Fetch secrets from Secret Manager and return them as a dictionary."""
    secrets_to_fetch = ["REDDIT_CLIENT_ID", "REDDIT_CLIENT_SECRET", "REDDIT_USER_AGENT", "DK_API_KEY"]
    fetched_secrets = {}

    # First, check local environment (for local development via .env)
    for s in secrets_to_fetch:
        val = os.getenv(s)
        if val:
            fetched_secrets[s] = val

    # If keys are missing (common in production), fetch from Secret Manager API
    if len(fetched_secrets) < len(secrets_to_fetch):
        client = secretmanager.SecretManagerServiceClient()
        for secret_id in secrets_to_fetch:
            if secret_id not in fetched_secrets:
                name = f"projects/{project_id}/secrets/{secret_id}/versions/latest"
                try:
                    response = client.access_secret_version(request={"name": name})
                    fetched_secrets[secret_id] = response.payload.data.decode("UTF-8")
                except Exception as e:
                    print(f"Warning: Could not fetch {secret_id} from Secret Manager: {e}")
    return fetched_secrets

def init_environment():
    """Consolidated environment discovery."""
    load_dotenv()
    try:
        _, project_id = google.auth.default()
    except Exception:
        project_id = os.getenv("GOOGLE_CLOUD_PROJECT")

    model_location = os.getenv("GOOGLE_CLOUD_LOCATION", "global")
    service_location = os.getenv("GOOGLE_CLOUD_REGION", "us-central1")

    secrets = {}
    if project_id:
        vertexai.init(project=project_id, location=service_location)
        secrets = _fetch_secrets(project_id)

    return project_id, model_location, service_location, secrets

Local Testing Script

The Google ADK comes with a built-in Web UI that is excellent for visualizing agent logic and tool composition.

You can launch it by running in the project root:

uv run adk web

However, the default Web UI will not test the long-term memory integration described in this tutorial because it is not pre-connected to a Vertex AI memory session. By default, the generic UI often relies on in-memory services that do not persist data across sessions. Therefore, we use the dedicated test_local.py script to explicitly initialize the VertexAiMemoryBankService. This ensures that even in a local environment, your agent is communicating with the real cloud-based memory bank to validate preference persistence.

The test_local.py script:

Connects to the real Vertex AI Agent Engine in the cloud for memory storage.
Uses an in-memory session service for local chat history (so you can wipe it easily).
Runs a chat loop where you can talk to your agent.

Go back to the root folder dev-signal:

cd ../..

Paste this code in dev-signal/test_local.py:

import asyncio
import os
import google.auth
import vertexai
import uuid
from dotenv import load_dotenv
from google.adk.runners import Runner
from google.adk.memory.vertex_ai_memory_bank_service import VertexAiMemoryBankService
from google.adk.sessions import InMemorySessionService
from vertexai import agent_engines
from google.genai import types
from dev_signal_agent.agent import root_agent

# Load environment variables
load_dotenv()

async def main():
    # 1. Setup Configuration
    project_id = os.getenv("GOOGLE_CLOUD_PROJECT")
    # Agent Engine (Memory) MUST use a regional endpoint
    resource_location = "us-central1"
    agent_name = "dev-signal"

    print(f"--- Initializing Vertex AI in {resource_location} ---")
    vertexai.init(project=project_id, location=resource_location)

    # 2. Find the Agent Engine Resource for Memory
    existing_agents = list(agent_engines.list(filter=f"display_name={agent_name}"))
    if existing_agents:
        agent_engine = existing_agents[0]
        agent_engine_id = agent_engine.resource_name.split("/")[-1]
        print(f"✅ Using persistent Memory Bank from Agent: {agent_engine_id}")
    else:
        print(f"❌ Error: Agent Engine '{agent_name}' not found. Please deploy with Terraform first.")
        return

    # 3. Initialize Services
    session_service = InMemorySessionService()
    memory_service = VertexAiMemoryBankService(
        project=project_id,
        location=resource_location,
        agent_engine_id=agent_engine_id
    )

    # 4. Create a Runner
    runner = Runner(
        agent=root_agent,
        app_name="dev-signal",
        session_service=session_service,
        memory_service=memory_service
    )

    # 5. Run a Test Loop
    user_id = "local-tester"
    print("\n--- TEST SCENARIO ---")
    print("1. Start a session, tell the agent your preference (e.g., 'write in rhymes').")
    print("2. Type 'new' to start a FRESH session (local state wiped).")
    print("3. Ask for a blog post. The agent should retrieve your preference from the CLOUD memory.")

    current_session_id = f"session-{str(uuid.uuid4())[:8]}"
    await session_service.create_session(
        app_name="dev-signal",
        user_id=user_id,
        session_id=current_session_id
    )
    print(f"\n--- Chat Session (ID: {current_session_id}) ---")

    while True:
        user_input = input("\nYou: ")
        if user_input.lower() in ["exit", "quit"]:
            break

        if user_input.lower() == "new":
            current_session_id = f"session-{str(uuid.uuid4())[:8]}"
            await session_service.create_session(
                app_name="dev-signal",
                user_id=user_id,
                session_id=current_session_id
            )
            print(f"\n--- Fresh Session Started (ID: {current_session_id}) ---")
            print("(Local history is empty, retrieval must come from Memory Bank)")
            continue

        print("Agent is thinking...")
        async for event in runner.run_async(
            user_id=user_id,
            session_id=current_session_id,
            new_message=types.Content(parts=[types.Part(text=user_input)])
        ):
            if event.content and event.content.parts:
                for part in event.content.parts:
                    if part.text:
                        print(f"Agent: {part.text}")
            if event.get_function_calls():
                for fc in event.get_function_calls():
                    print(f"🛠️ Tool Call: {fc.name}")

if __name__ == "__main__":
    asyncio.run(main())

Running the Test

First, ensure you have your Application Default Credentials set up:

gcloud auth application-default login

Then run the script:

uv run test_local.py

Test Scenario

This scenario validates the full end-to-end lifecycle of the agent: from discovery and research to multimodal content creation and long-term memory retrieval.

Phase 1: Teaching & Multimodal Creation (Session 1)

Goal: Establish technical context and set a specific stylistic preference.

Discovery

Ask the agent to find trending Cloud Run topics.

Input: "Find high-engagement questions about AI agents on Cloud Run from the last 21 days."

Research

Instruct the agent to perform a deep dive on a specific result.

Input: "Use the GCP Expert to research topic #1."

Personalization

Request a blog post and explicitly set your style preference.

Input: "Draft a blog post based on this research. From now on, I want all my technical blogs written in the style of a 90s Rap Song."

Image Generation

Ask the agent to generate an image that demonstrates the main ideas in the blog using the Nano Banana Pro tool. The image will be saved to your bucket in Google Cloud and you should get the path to see it, which will look like: https://storage.mtls.cloud.google.com/...

Phase 2: Long-Term Memory Recall (Session 2)

Goal: Verify the agent recalls preferences across a completely fresh session.

Type new in the console to wipe local session history and start a fresh state.
Retrieval: Inquire about your stored preferences to test the Vertex AI memory bank.

Input: "What are my current topics of interest and what is my preferred blogging style?"

Verification: Confirm the agent successfully retrieves your "AI Agents on Cloud Run" interest and "Rap" style from the cloud.

Final Test: Ask for a new blog on a different topic (e.g., "GKE Autopilot") and ensure it is automatically written as a rap song without being prompted.

Summary

In this part of our series we focused on verifying the agent's functionality in a local environment before proceeding to cloud deployment. By configuring local secrets and utilizing environment-aware utilities, we used a dedicated test runner to confirm that the core reasoning and tool logic are properly integrated. We successfully validated the full lifecycle: from Reddit discovery to expert content creation, confirming that the agent correctly retrieves preferences from the cloud-based Vertex AI memory bank even in completely fresh sessions.

Ready to run the test scenario yourself? Clone the repository and try the test_local.py script to see 'Dev Signal' retrieve your preferences from the Vertex AI memory bank in real-time. For a deeper dive into the underlying mechanics of memory orchestration, check out this quickstart guide.

In the final part of this series, we will transition our prototype into a production service on Google Cloud Run using Terraform for secure infrastructure, and explore the roadmap to production excellence through continuous evaluation and security.

Special thanks to Remigiusz Samborski for the helpful review and feedback on this article.

For more content like this, follow me on LinkedIn and X.

Architect A Personalized Multi-Agent System with Long-Term Memory

Shir Meir Lador — Thu, 07 May 2026 21:01:06 +0000

In support of our mission to accelerate the developer journey on Google Cloud, we built Dev Signal — a multi-agent system designed to transform raw community signals into reliable technical guidance by automating the path from discovery to expert creation.

In the first part of this series for the Dev Signal, we laid the essential groundwork for this system by establishing a project environment and equipping core capabilities through the Model Context Protocol (MCP). We standardized our external integrations, connecting to Reddit for trend discovery, Google Cloud Docs for technical grounding, and building a custom Nano Banana Pro MCP server for multimodal image generation. If you missed Part 1 or want to explore the code directly, you can find the complete project implementation in our GitHub repository.

Now, in Part 2, we focus on building the multi-agent architecture and integrating the Vertex AI memory bank to personalize these capabilities. We will implement a Root Orchestrator that manages three specialist agents: the Reddit Scanner, GCP Expert, and Blog Drafter, to provide a seamless flow from trend discovery to expert content creation. We will also integrate a long-term memory layer that enables the agent to learn from your feedback and persist your stylistic preferences across different conversations. This ensures that Dev Signal doesn't just process data, but actually learns to match your professional voice over time.

Infrastructure and Model Setup

First, we initialize the environment and the shared Gemini model.

Paste this code in dev_signal_agent/agent.py:

from google.adk.agents import Agent
from google.adk.apps import App
from google.adk.models import Gemini
from google.adk.tools import google_search, AgentTool, load_memory_tool, preload_memory_tool
from google.adk.tools.tool_context import ToolContext
from google.genai import types
from dev_signal_agent.app_utils.env import init_environment
from dev_signal_agent.tools.mcp_config import (
    get_reddit_mcp_toolset,
    get_dk_mcp_toolset,
    get_nano_banana_mcp_toolset
)

PROJECT_ID, MODEL_LOC, SERVICE_LOC, SECRETS = init_environment()

shared_model = Gemini(
    model="gemini-3-flash-preview",
    vertexai=True,
    project=PROJECT_ID,
    location=MODEL_LOC,
    retry_options=types.HttpRetryOptions(attempts=3),
)

Memory Ingestion Logic

We want Dev Signal to do more than just follow instructions — we want it to learn from you. By capturing your preferences, such as specific technical interests on Reddit or a preferred blogging style, the agent can personalize its output for future use. To achieve this, we use the Vertex AI memory bank to persist session history across different conversations.

Long-term Memory

We automate this through the save_session_to_memory_callback function. This callback is configured to run automatically after every turn, ensuring that session details are captured and stored in the memory bank without manual intervention.

How Managed Memory Works:

Ingestion: The save_session_to_memory_callback sends the conversation data to Vertex AI.
Embedding: Vertex AI converts the text into numerical vectors (embeddings) that capture the semantic meaning of your preferences.
Storage: These vectors are stored in a managed index, enabling the agent to perform semantic searches and retrieve relevant history in future sessions.
Retrieval: The agent recalls this history using built-in ADK tools. The PreloadMemoryTool proactively brings in context at the start of an interaction, while the LoadMemoryTool allows the agent to fetch specific memories on an as-needed basis.

Paste this code in dev_signal_agent/agent.py:

async def save_session_to_memory_callback(*args, **kwargs) -> None:
    """
    Defensive callback to persist session history to the Vertex AI memory bank.
    """
    ctx = kwargs.get("callback_context") or (args[0] if args else None)
    # Check connection to Memory Service
    if ctx and hasattr(ctx, "_invocation_context") and ctx._invocation_context.memory_service:
        # Save the session!
        await ctx._invocation_context.memory_service.add_session_to_memory(
            ctx._invocation_context.session
        )

Short-term Memory

The add_info_to_state function serves as the agent's short-term working memory, allowing the gcp_expert to reliably hand off its detailed findings to the blog_drafter within the same session. This working memory and the conversation transcript are managed by the Vertex AI Session Service to ensure that active context survives server restarts or transient failures.

The boundary between session-based state and long-term persistence — It is important to note that while this service provides stability during an active interaction, this short-term memory does not persist between different sessions. Starting a fresh session ID effectively resets this working state, ensuring a clean slate for new tasks. Cross-session continuity, where the agent remembers your stylistic preferences or past feedback, is handled by the Vertex AI Memory Bank.

Paste this code in dev_signal_agent/agent.py

def add_info_to_state(tool_context: ToolContext, key: str, data: str) -> dict:
    tool_context.state[key] = data
    return {"status": "success", "message": f"Saved '{key}' to state."}

Specialist 1: Reddit Scanner (Discovery)

The Reddit Scanner is our "Trend Spotter," it identifies high-engagement questions from the last 21 days (3 weeks) to ensure that all research findings remain both timely and relevant.

Memory Usage: It leverages load_memory to retrieve your past areas of interest and preferred topics from the Vertex AI memory bank. If relevant history exists, the agent prioritizes those specific topics in its search to provide a personalized discovery experience.

Beyond simple retrieval, each sub-agent actively updates its memories by listening for new preferences and explicitly acknowledging them during the chat. This process captures relevant information in the session history, where an automated callback then persists it to the long-term Vertex AI memory bank for future use.

This memory management is supported by two distinct retrieval patterns within the Google Agent Development Kit (ADK). The first is the PreloadMemoryTool, which proactively brings in historical context at the beginning of every interaction to ensure the agent is fully briefed before addressing the current request. The second is the LoadMemoryTool, which the agent uses on an as-needed basis, calling upon it only when it decides that deeper past knowledge would be beneficial for the current step in the workflow.

Paste this code in dev_signal_agent/agent.py

# Singleton toolsets
reddit_mcp = get_reddit_mcp_toolset(
    client_id=SECRETS.get("REDDIT_CLIENT_ID", ""),
    client_secret=SECRETS.get("REDDIT_CLIENT_SECRET", ""),
    user_agent=SECRETS.get("REDDIT_USER_AGENT", "")
)

reddit_scanner = Agent(
    name="reddit_scanner",
    model=shared_model,
    instruction="""
You are a Reddit research specialist. Your goal is to identify high-engagement questions
from the last 3 weeks on specific topics of interest, such as AI/agents on Cloud Run.

Follow these steps:
1. **MEMORY CHECK**: Use `load_memory` to retrieve the user's **past areas of interest** and **preferred topics**. Calibrate your search to align with these interests.
2. Use the Reddit MCP tools to search for relevant subreddits and posts.
3. Filter results for posts created within the last 21 days (3 weeks).
4. Analyze "high-engagement" based on upvote counts and the number of comments.
5. Recommend the most important and relevant questions for a technical audience.
6. **CRITICAL**: For each recommended question, provide a direct link to the original thread and a concise summary of the discussion.
7. **CAPTURE PREFERENCES**: Actively listen for user preferences, interests, or project details. Explicitly acknowledge them to ensure they are captured in the session history for future personalization.
""",
    tools=[reddit_mcp, load_memory_tool.LoadMemoryTool()],
    after_agent_callback=save_session_to_memory_callback,
)

Specialist 2: GCP Expert (Grounding)

The GCP Expert is our "Technical Authority". It triangulates facts by synthesizing official documentation from the Google Cloud Developer Knowledge MCP Server, community sentiment from Reddit, and broader context from Google Search.

Paste this code in dev_signal_agent/agent.py

dk_mcp = get_dk_mcp_toolset(api_key=SECRETS.get("DK_API_KEY", ""))

search_agent = Agent(
    name="search_agent",
    model=shared_model,
    instruction="Execute Google Searches and return raw, structured results (Title, Link, Snippet).",
    tools=[google_search],
)

gcp_expert = Agent(
    name="gcp_expert",
    model=shared_model,
    instruction="""
You are a Google Cloud Platform (GCP) documentation expert.
Your goal is to provide accurate, detailed, and cited answers to technical questions by synthesizing official documentation with community insights.

For EVERY technical question, you MUST perform a comprehensive research sweep using ALL available tools:
1. **Official Docs (Grounding)**: Use DeveloperKnowledge MCP (`search_documents`) to find the definitive technical facts.
2. **Social Media Research (Reddit)**: Use the Reddit MCP to research the question on social media. This allows you to find real-world user discussions, common pain points, or alternative solutions that might not be in official documentation.
3. **Broader Context (Web/Social)**: Use the `search_agent` tool to find recent technical blogs, social media discussions, or tutorials.

Synthesize your answer:
- Start with the official answer based on GCP docs.
- Add "Social Media Insights" or "Common Issues" sections derived from Reddit and Web Search findings.
- **CRITICAL**: After providing your answer, you MUST use the `add_info_to_state` tool to save your full technical response under the key: `technical_research_findings`.
- Cite your sources specifically at the end of your response, providing **direct links** (URLs) to the official documentation, blog posts, and Reddit threads used.
- **CAPTURE PREFERENCES**: Actively listen for user preferences, interests, or project details. Explicitly acknowledge them to ensure they are captured in the session history for future personalization.
""",
    tools=[dk_mcp, AgentTool(search_agent), reddit_mcp, add_info_to_state],
    after_agent_callback=save_session_to_memory_callback,
)

Specialist 3: Blog Drafter (Creativity)

The Blog Drafter is our Content Creator. It drafts the blog based on the expert's findings and offers to generate visuals.

Memory Usage: It checks load_memory for the user's preferred writing style (e.g. "Witty", "Rap") stored in the Vertex AI memory bank.

Paste this code in dev_signal_agent/agent.py

nano_mcp = get_nano_banana_mcp_toolset()

blog_drafter = Agent(
    name="blog_drafter",
    model=shared_model,
    instruction="""
You are a professional technical blogger specializing in Google Cloud Platform.
Your goal is to draft high-quality blog posts based on technical research provided by the GDE expert and reliable documentation.

You have access to the research findings from the gcp_expert_agent here:
{{ technical_research_findings }}

Follow these steps:
1. **MEMORY CHECK**: Use `load_memory` to retrieve past blog posts, **areas of interest**, and user feedback on writing style. Adopt the user's preferred style and depth.
2. **REVIEW & GROUND**: Review the technical research findings provided above. **CRITICAL**: Use the `dk_mcp` (Developer Knowledge) tool to verify key facts, technical limitations, and API details. Ensure every claim in your blog is grounded in official documentation.
3. Draft a blog post that is engaging, accurate, and helpful for a technical audience.
4. Include code snippets or architectural diagrams if relevant.
5. Provide a "Resources" section with links to the official documentation used.
6. Ensure the tone is professional yet accessible, while adhering to any style preferences found in memory.
7. **VISUALS**: After presenting the drafted blog post, explicitly ask the user: "Would you like me to generate an infographic-style header image to illustrate these key points?" If they agree, use the `generate_image` tool (Nano Banana).
8. **CAPTURE PREFERENCES**: Actively listen for user preferences, interests, or project details. Explicitly acknowledge them to ensure they are captured in the session history for future personalization.
""",
    tools=[dk_mcp, load_memory_tool.LoadMemoryTool(), nano_mcp],
    after_agent_callback=save_session_to_memory_callback,
)

The Root Orchestrator

The root agent serves as the system's strategist, managing a team of specialist agents and orchestrating their actions based on the specific goals provided by the user. At the start of a conversation, the orchestrator retrieves memory to establish context by checking for the user's past areas of interest, preferred topics, or previous projects.

Paste this code in dev_signal_agent/agent.py

root_agent = Agent(
    name="root_orchestrator",
    model=shared_model,
    instruction="""
You are a technical content strategist. You manage three specialists:
1. reddit_scanner: Finds trending questions and high-engagement topics on Reddit.
2. gcp_expert: Provides technical answers based on official GCP documentation.
3. blog_drafter: Writes professional blog posts based on technical research.

Your responsibilities:
- **MEMORY CHECK**: At the start of a conversation, use `load_memory` to check if the user has specific **areas of interest**, preferred topics, or past projects. Tailor your suggestions accordingly.
- **CAPTURE PREFERENCES**: Actively listen for user preferences, interests, or project details. Explicitly acknowledge them to ensure they are captured in the session history for future personalization.
- If the user wants to find trending topics or questions from Reddit, delegate to reddit_scanner.
- If the user has a technical question or wants to research a specific theme, delegate to gcp_expert.
- **CRITICAL**: After the gcp_expert provides an answer, you MUST ask the user:
  "Would you like me to draft a technical blog post based on this answer?"
- If the user agrees or asks to write a blog, delegate to blog_drafter.
- Be proactive in helping the user navigate from discovery (Reddit) to research (Docs) to content creation (Blog).
""",
    tools=[load_memory_tool.LoadMemoryTool(), preload_memory_tool.PreloadMemoryTool()],
    after_agent_callback=save_session_to_memory_callback,
    sub_agents=[reddit_scanner, gcp_expert, blog_drafter]
)

app = App(root_agent=root_agent, name="dev_signal_agent")

Summary

In this part of our series, we built multi-agent architecture and implemented a robust, dual-layered memory system. We established a Root Orchestrator, managing three specialist agents: a Reddit Scanner for trend discovery, a GCP Expert for technical grounding, and a Blog Drafter for creative content creation.

By utilizing short-term state to pass information reliably between specialists and integrating the Vertex AI memory bank for long-term persistence, we've enabled the agent to learn from your feedback and remember specific writing styles across different conversations.

In Part 3, we will show you how to test the agent locally to verify these components on your workstation, before transitioning to a full production deployment on Google Cloud Run in Part 4. Can't wait for part 3? The full implementation is already available for you to explore on GitHub.

To learn more about the underlying technology, explore the Vertex AI Memory Bank overview or dive into the official ADK Documentation to see how to orchestrate complex multi-agent workflows.

Special thanks to Remigiusz Samborski for the helpful review and feedback on this article.

For more content like this, follow me on LinkedIn and X.

Building Capabilities for a Multi-Agent System with Google ADK, MCP, and Cloud Run

Shir Meir Lador — Thu, 07 May 2026 21:00:10 +0000

My team's mission is to accelerate the developer journey from writing code to running secure AI workloads on Google Cloud. To help developers succeed, we focus on identifying their most pressing questions and building demos that provide straightforward, easy-to-implement solutions.

Recently, I was struck with inspiration when the new Developer Knowledge MCP server was released. It led me to build Dev Signal—a multi-agent system designed with Google Agent Development Kit (ADK)—to identify technical questions from Reddit, research them using official documentation, and draft detailed technical blogs. Dev Signal also provides custom visuals using Nano Banana Pro. I even integrated a long-term memory layer so the agent remembers my specific preferences and blogging style.

By connecting my coding assistant, Gemini CLI, to the developer knowledge MCP server, I built and deployed this entire system to Google Cloud Run in just two days.

Whether you want to learn how to architect a complex multi-agent system with long term memory, leverage local and remote MCP servers for tool standardization, or write detailed Terraform scripts for secure Cloud Run deployment, I'll show you how!

If you'd rather dive straight into the code and explore it at your own pace, you can clone the repository here.

What you'll learn

In this four-part blog series, I'll walk you through the step-by-step process of how I brought this project to life. Each blog post captures the journey of building and deploying Dev Signal:

Part 1: Tools for building agent capabilities – You'll begin by setting up your project environment and equipping your agent with tools using the Model Context Protocol (MCP). You'll learn how to connect to Reddit for trend discovery, Google Cloud docs for technical grounding, and a custom Nano Banana Pro tool for image generation.
Part 2: The Multi-Agent Architecture with long term memory – You'll build the "brain" of the system by implementing a root orchestrator and a team of specialized agents. You'll also integrate the Vertex AI memory bank, enabling the agent to learn and persist your preferences across sessions.
Part 3: Testing the agent Locally – Before moving to the cloud, you'll synchronize the agent's components and verify its performance on your workstation. You'll use a dedicated test runner to simulate the full lifecycle of discovery, research, and multimodal creation, with a special focus on validating long-term memory persistence by connecting your local agent directly to the cloud-based Vertex AI memory bank.
Part 4: Deployment to Cloud Run and the Path to Production – Finally, you'll deploy your service on Google Cloud Run using Terraform for reproducible infrastructure. You'll also discuss the next steps required for a high quality secure production system.

Getting started with Dev Signal

Dev Signal is an intelligent monitoring agent designed to filter noise and create value. Dev Signal operates in the following ways:

Discovery: Scouts Reddit for high-engagement technical questions.
Grounding: Researches answers using official Google Cloud documentation to ensure accuracy.
Creation: Drafts professional technical blog posts based on its findings.
Multimodal Generation: Generates custom infographic headers for those posts.
Long-Term Memory: Uses Vertex AI memory bank to remember your feedback across different sessions.

Prerequisites

Before you begin, verify the following is installed:

Python 3.12+
uv (Python package manager): curl -LsSf https://astral.sh/uv/install.sh | sh
Google Cloud SDK (gcloud CLI) installed and authenticated.
Terraform (for infrastructure as code).
Node.js & npm (required for the Reddit MCP tool).

You will also need:

A Google Cloud Project with billing enabled.
APIs Enabled: Vertex AI, Cloud Run, Secret Manager, Artifact Registry.
Reddit API Credentials (Client ID, Secret) - You can get these from the Reddit Developer Portal.
Developer Knowledge API Key (for Google Cloud docs search) - Instructions on how to get it are here.

Project Setup

The Dev Signal system was built by first running the Agent Starter Pack, following the automated architect workflow described in the Agent Factory episode by Remigiusz Samborski and Vlad Kolesnikov. This foundation provided the project's modular directory structure, which is used to separate concerns between Agent Logic, Server Code, Utilities, and Tools.

The starter pack acts as a powerful starting point because it automates the creation of professional infrastructure, CI/CD pipelines, and observability tools in seconds. This allows you to focus entirely on the agent's unique intelligence while ensuring the underlying platform remains secure and scalable. By building on top of this generated boilerplate with AI assistance from Gemini CLI and Antigravity, the development process is highly accelerated.

The agent starter pack high level architecture:

1. Initialize the Project

Create a new directory for your project and initialize it. We'll use uv, which is an extremely fast Python package manager.

uv init dev-signal

2. Folder Structure

Our project will follow this structure. We will populate these files step-by-step.

dev-signal/
├── dev_signal_agent/
│   ├── __init__.py
│   ├── agent.py                # Agent logic & orchestration
│   ├── fast_api_app.py         # Application server & memory connection
│   ├── app_utils/              # Env Config
│   │   └── env.py
│   └── tools/                  # External capabilities
│       ├── __init__.py
│       ├── mcp_config.py       # Tool configuration (Reddit, Docs)
│       └── nano_banana_mcp/    # Custom local image generation tool
│           ├── __init__.py
│           ├── main.py
│           ├── nano_banana_pro.py
│           ├── media_models.py
│           ├── storage_utils.py
│           └── requirements.txt
├── deployment/
│   └── terraform/              # Infrastructure as Code
├── .env                        # Local secrets (API keys)
├── Makefile                    # Shortcuts for building/deploying
├── Dockerfile                  # Container definition
└── pyproject.toml              # Dependencies

3. Define Dependencies

Update your pyproject.toml with the necessary dependencies. We use google-adk for the agent framework and google-genai for the model interaction.

[project]
name = "dev-signal"
version = "0.1.0"
description = "A multi-agent system for monitoring and content creation."
readme = "README.md"
requires-python = ">=3.12, <3.14"
dependencies = [
    "google-adk>=0.1.0",
    "google-genai>=1.0.0",
    "mcp>=1.0.0",
    "python-dotenv>=1.0.0",
    "fastapi>=0.110.0",
    "uvicorn>=0.29.0",
    "google-cloud-logging>=3.0.0",
    "google-cloud-aiplatform>=1.38.0",
    "fastmcp>=2.13.0",
    "google-cloud-storage>=3.6.0",
    "google-auth>=2.0.0",
    "google-cloud-secret-manager>=2.26.0",
]

Run uv sync to install everything.

Create a new directory for the agent code.

mkdir dev_signal_agent
cd dev_signal_agent

Building the agent capabilities: MCP tools

Our agent needs to interact with the outside world. We use the Model Context Protocol (MCP) to standardize this. The Model Context Protocol (MCP) is a universal standard for connecting AI agents to external data and tools. Instead of writing custom API wrappers, we use standard MCP servers. This allows us to connect to APIs (Reddit), Knowledge Bases (Google Cloud Docs), and even local scripts (Image Generation using Nano Banana Pro) using a common interface. Create a new directory for the agent tools.

mkdir tools
cd tools

Tools Configuration

We'll define our toolsets in dev_signal_agent/tools/mcp_config.py.

This file defines the connection parameters for our three main tools.

Reddit: Connected via a local stdio subprocess.
Developer Knowledge: Connected via a remote HTTP endpoint.
Nano Banana: Connected via a local stdio subprocess (our custom Python script).

Reddit Search (Discovery Tool)

The Reddit MCP server acts as a bridge to the Reddit API, allowing your agent to discover trending posts and analyze engagement without you having to write complex API wrappers. To ensure portability, the code uses a "find or fetch" strategy: it first checks for a local installation and, if missing, automatically uses npx to download and run the server on demand.

Instead of a network connection, the agent launches the server as a local subprocess and communicates via standard input and output (stdio). Within the Google ADK, the McpToolset class acts as a universal wrapper that standardizes these connections, enabling your agent to interact with various tools, from community resources to custom scripts like the Nano Banana image generator, using a common interface. By securely passing API credentials through environment variables, the system ensures these "plug-and-play" modules function as a seamless bridge between the AI and external platforms.

Paste this code in dev_signal_agent/tools/mcp_config.py:

import os
import shutil
from mcp import StdioServerParameters
from google.adk.tools import McpToolset
from google.adk.tools.mcp_tool import StreamableHTTPConnectionParams, StdioConnectionParams

def get_reddit_mcp_toolset(client_id: str = "", client_secret: str = "", user_agent: str = ""):
    """
    Connects to the Reddit MCP server.
    This server runs as a local subprocess (stdio) and proxies requests to the Reddit API.
    """
    # Check if 'reddit-mcp' is installed globally, otherwise use npx to run it
    cmd = "reddit-mcp" if shutil.which("reddit-mcp") else "npx"
    args = [] if shutil.which("reddit-mcp") else ["-y", "--quiet", "reddit-mcp"]

    # Inject secrets into the environment of the subprocess only
    env = {
        **os.environ,
        "DOTENV_CONFIG_SILENT": "true",
        "LANG": "en_US.UTF-8"
    }
    if client_id: env["REDDIT_CLIENT_ID"] = client_id
    if client_secret: env["REDDIT_CLIENT_SECRET"] = client_secret
    if user_agent: env["REDDIT_USER_AGENT"] = user_agent

    return McpToolset(
        connection_params=StdioConnectionParams(
            server_params=StdioServerParameters(
                command=cmd,
                args=args,
                env=env  # Pass injected secrets directly to the subprocess
            ),
            timeout=120.0
        )
    )

Google Cloud Docs (Knowledge Tool)

The Developer Knowledge MCP server provides grounding for your agent by allowing it to search the entire corpus of official Google Cloud documentation. Unlike the local Reddit server, this is a managed service hosted by Google and accessed as a remote endpoint over the internet. It exposes specialized tools like google_developer_documentation_search for semantic queries and google_developer_documentation_fetch to retrieve full markdown content, ensuring that every technical claim the agent makes is supported by definitive, up-to-date facts.

Note: You can also connect your coding assistant tools such as Gemini CLI or Antigravity to the developer knowledge MCP server to empower them with handy up to date Google Cloud documentation. I used it when writing this blog!

To connect, the agent uses the McpToolset class with StreamableHTTPConnectionParams, pointing to a web URL instead of launching a local process. It securely authenticates using a DK_API_KEY (create your api key) passed in the request headers, allowing the agent to perform a "comprehensive research sweep" across official docs, community sentiment, and broader web context through a single standardized interface.

Paste this code in dev_signal_agent/tools/mcp_config.py:

def get_dk_mcp_toolset(api_key: str = ""):
    """
    Connects to Developer Knowledge (Google Cloud Docs).
    This is a remote MCP server accessed via HTTP.
    """
    headers = {}
    if api_key:
        headers["X-Goog-Api-Key"] = api_key
    else:
        # Fallback to os.environ for local testing if not passed via API
        headers["X-Goog-Api-Key"] = os.getenv("DK_API_KEY", "")

    return McpToolset(
        connection_params=StreamableHTTPConnectionParams(
            url="https://developerknowledge.googleapis.com/mcp",
            headers=headers
        )
    )

The Image Generator (Nano Banana MCP)

While we've used external MCP servers for Reddit and documentation, we can also build our own custom MCP server to wrap specific Python logic. In this case, we are creating an image generation tool powered by Gemini 3 Pro Image (also known as Nano Banana Pro). This demonstrates that any Python function can be standardized into a tool that any agent can understand.

How the image generation works:

FastMCP: We use the fastmcp library to drastically simplify server creation, allowing us to register Python functions as tools with just a few lines of code.
Gemini Integration: The server uses the Google GenAI SDK to call the gemini-3-pro-image-preview model, which converts the agent's descriptive prompts into raw image bytes.
GCS Upload & Hosting: Because agent interfaces typically require a URL to display images, the server automatically uploads the generated bytes to Google Cloud Storage (GCS) and returns a public link.

To connect this local tool, we use StdioConnectionParams because the server runs as a local subprocess communicating via standard input and output. This transport method directly matches the transport="stdio" configuration we will define in our server entrypoint, ensuring a seamless connection for your custom local scripts.

The following code defines the MCP connection in dev_signal_agent/tools/mcp_config.py. We use uv run to ensure the server starts in an isolated environment with all its dependencies correctly installed.

Paste this code in dev_signal_agent/tools/mcp_config.py:

def get_nano_banana_mcp_toolset():
    """
    Connects to our local 'Nano Banana' image generator.
    This demonstrates how to wrap a local Python script as an MCP tool.
    """
    path = os.path.join("dev_signal_agent", "tools", "nano_banana_mcp", "main.py")
    bucket = os.getenv("AI_ASSETS_BUCKET")

    return McpToolset(
        connection_params=StdioConnectionParams(
            server_params=StdioServerParameters(
                command="uv",
                args=["run", path],
                env={**os.environ, "AI_ASSETS_BUCKET": bucket}
            ),
            timeout=600.0  # Image generation can take time
        )
    )

Implementing the Nano Banana Pro Server Logic

Now, we will implement the actual logic for this server. This implementation is based on the Agent Factory demo code by Remigiusz Samborski. While Remi's original code provides instructions for deploying the MCP server to Cloud Run, we will run it here as a local subprocess for faster development and testing.

To get started, create the directory for our new server:

mkdir -p dev_signal_agent/tools/nano_banana_mcp
cd dev_signal_agent/tools/nano_banana_mcp

The Server Entrypoint (`main.py`)

This file acts as the "brain" that initializes and starts the MCP server.

FastMCP Initialization: We use the FastMCP library to create a server named "MediaGenerators" and register our generate_image function as a tool.
Safe Logging: The _initialize_console_logging function is critical. It forces all logs to sys.stderr. This is because the MCP "stdio" transport uses sys.stdout for communication between the agent and the tool; standard logs sent to stdout would corrupt that protocol.
Execution: The mcp.run(transport="stdio") line starts the server as a local subprocess, allowing it to listen for requests from your agent via standard input.

Paste this code in dev_signal_agent/tools/nano_banana_mcp/main.py:

import logging
import os
import sys
from fastmcp import FastMCP
from dotenv import load_dotenv
from nano_banana_pro import generate_image

def _initialize_console_logging(min_level: int = logging.INFO):
    # Ensure logs go to STDERR so they don't break the MCP stdio protocol
    handler = logging.StreamHandler(sys.stderr)
    logging.basicConfig(level=min_level, handlers=[handler], force=True)

tools = [generate_image]
mcp = FastMCP(name="MediaGenerators", tools=tools)

if __name__ == "__main__":
    load_dotenv()
    _initialize_console_logging()
    mcp.run(transport="stdio")

The Generation Logic (`nano_banana_pro.py`)

This is where the actual image generation happens using Gemini.

GenAI Client: We initialize the genai.Client() to interact with Google's generative models.
Model Selection: It specifically targets the gemini-3-pro-image-preview model. We set the response_modalities to "IMAGE" to tell the model we want pixels, not just text.
Robustness: The code includes a MAX_RETRIES loop (set to 5) to handle any transient generation errors, ensuring the agent has multiple attempts to get a valid image.
Byte Processing: Once the model generates the image, it arrives as raw inline data. We extract these bytes and call our helper to move them to the cloud.
URI Conversion: Finally, it replaces the internal gs:// path with a browser-accessible https:// URL so the user can actually see the image.

Paste this code in dev_signal_agent/tools/nano_banana_mcp/nano_banana_pro.py:

import logging
from typing import Literal, Optional
from google import genai
from google.genai import types
from media_models import MediaAsset
from storage_utils import upload_data_to_gcs

AUTHORIZED_URI = "https://storage.mtls.cloud.google.com/"
MAX_RETRIES = 5

async def generate_image(
    prompt: str,
    aspect_ratio: Literal["16:9", "9:16"] = "16:9",
) -> MediaAsset:
    """Generates an image using Gemini 3 Image model."""
    genai_client = genai.Client()
    content = types.Content(parts=[types.Part.from_text(text=prompt)], role="user")
    logging.info(f"Starting image generation for prompt: {prompt[:50]}...")

    asset = MediaAsset(uri="")
    for _ in range(MAX_RETRIES):
        response = genai_client.models.generate_content(
            model="gemini-3-pro-image-preview",
            contents=[content],
            config=types.GenerateContentConfig(
                response_modalities=["IMAGE"],
                image_config=types.ImageConfig(aspect_ratio=aspect_ratio)
            )
        )
        if response and response.parts:
            for part in response.parts:
                if part.inline_data and part.inline_data.data:
                    # Upload the raw bytes to GCS
                    gcs_uri = await upload_data_to_gcs(
                        "mcp-tools",
                        part.inline_data.data,
                        part.inline_data.mime_type
                    )
                    asset = MediaAsset(uri=gcs_uri)
                    break
        if asset.uri: break

    if not asset.uri:
        asset.error = "No image was generated."
    else:
        # Convert gs:// URI to an HTTP accessible URL if needed
        asset.uri = asset.uri.replace('gs://', AUTHORIZED_URI)
    logging.info(f"Image URL: {asset.uri}")
    return asset

GCS Upload Helper (`storage_utils.py`)

Since agents need a web link to display images, this utility handles the hosting on Google Cloud Storage (GCS).

Dynamic Bucket Selection: It looks for a bucket name in your environment variables, falling back from AI_ASSETS_BUCKET to LOGS_BUCKET_NAME to ensure it always has a place to save data.
Unique Filenames: We use an MD5 hash of the raw image data to create a unique filename. This prevents filename collisions and acts as a simple way to avoid duplicate uploads of the same image.
Cloud Upload: The blob.upload_from_string method pushes the raw image bytes directly to your GCS bucket.

Paste this code in dev_signal_agent/tools/nano_banana_mcp/storage_utils.py:

import hashlib
import mimetypes
import os
from google.cloud.storage import Client, Blob
from dotenv import load_dotenv

load_dotenv()

storage_client = Client()
ai_bucket_name = os.environ.get("AI_ASSETS_BUCKET") or os.environ.get("LOGS_BUCKET_NAME")
ai_bucket = storage_client.bucket(ai_bucket_name)

async def upload_data_to_gcs(agent_id: str, data: bytes, mime_type: str) -> str:
    file_name = hashlib.md5(data).hexdigest()
    ext = mimetypes.guess_extension(mime_type) or ""
    blob_name = f"assets/{agent_id}/{file_name}{ext}"
    blob = Blob(bucket=ai_bucket, name=blob_name)
    blob.upload_from_string(data, content_type=mime_type, client=storage_client)
    return f"gs://{ai_bucket_name}/{blob_name}"

Data Model (`media_models.py`)

This file ensures that our data follows a strict structure (Schema).

Structured Output: By using a Pydantic BaseModel, we guarantee that the tool always returns a consistent JSON object containing a uri (the link) and an optional error message. This makes it much easier for the AI agent to understand and process the tool's result.

Paste this code in dev_signal_agent/tools/nano_banana_mcp/media_models.py:

from typing import Optional
from pydantic import BaseModel

class MediaAsset(BaseModel):
    uri: str
    error: Optional[str] = None

Tool Dependencies (`requirements.txt`)

While we use uv to run our code, a requirements.txt file remains essential because it defines the specific dependencies uv needs to install for the Nano Banana server to function. This provides the necessary "ingredients" to set up the isolated environment before the server starts.

This file lists the three core libraries required for this tool:

google-cloud-storage: Used for hosting the generated images on the cloud.
google-genai: Provides the logic for the Gemini 3 Pro image generation.
fastmcp: The framework that turns our Python script into a standardized MCP tool.

Paste this code in dev_signal_agent/tools/nano_banana_mcp/requirements.txt:

google-cloud-storage==3.6.*
google-genai==1.52.*
fastmcp==2.13.*

Summary

In this first part of our series, we focused on establishing the agent's core capabilities by standardizing its external integrations through the Model Context Protocol (MCP). We initialized the project using uv for high-speed dependency management and successfully configured three critical toolsets: Reddit for trend discovery, Google Cloud Docs for technical grounding, and a custom "Nano Banana" MCP server for multimodal image generation. By utilizing the Google ADK's McpToolset, we've abstracted away complex API logic into simple, plug-and-play modules, ensuring that our tools share a common interface that decouples integration from intelligence.

For a deeper look into our technical foundation, you can explore the Developer Knowledge MCP server to learn more about knowledge grounding or visit the Google ADK GitHub repository to explore the framework's core capabilities.

With our toolset fully configured and ready for action, we can now move to Part 2, where we will build the multi-agent architecture and integrate the Vertex AI memory bank to orchestrate these capabilities. You can also jump ahead to Part 3, where we will show you how to test the agent locally to verify these components on your workstation. If you’d like to dive ahead, you can explore the complete code for the entire series in our GitHub repository.

Special thanks to Remigiusz Samborski for the helpful review and feedback on this article.

For more content like this, follow Shir on LinkedIn and X.

Fine-Tuning Gemma 4 with Cloud Run Jobs: Serverless GPUs (NVIDIA RTX 6000 Pro) for pet breed classification 🐈🐕

Shir Meir Lador — Tue, 28 Apr 2026 19:54:21 +0000

Google has just announced the release of Gemma 4! This new generation of open models brings significant advancements, particularly in reasoning capabilities and architectural efficiency.

Bridging Reasoning and Precision with Gemma 4

In my previous blog, I demonstrated how to fine-tune Gemma 3 27B on Cloud Run Jobs using NVIDIA RTX PRO 6000 Blackwell Edition GPUs for pet breed classification. With the release of Gemma 4, I couldn't wait to update my pipeline and see how the new model performs.

In this follow-up post, I'll explain what makes Gemma 4 different, the benefits it brings, and exactly what file modifications and workarounds are needed to successfully fine-tune it using PEFT (LoRA) on Cloud Run. We'll cover everything from memory requirements and dynamic label masking to prompt structures for reasoning models. Whether you read the previous post or are new to this pipeline, this guide will provide a complete, working solution for Gemma 4.

If you'd rather dive straight into the code and explore it at your own pace, you can clone the repository here.

What's New in Gemma 4?

Gemma 4 introduces groundbreaking improvements over Gemma 3, making it Google's most intelligent open model family to date:

Apache 2.0 License: Gemma 4 is released under a commercially permissive Apache 2.0 license, providing full developer flexibility.
Highly Competitive Benchmarks: The 31B model ranks as the #3 open model on the Arena AI text leaderboard, while the 26B MoE model ranks #6, outcompeting models 20x their size!
Advanced Reasoning & Agents: Purpose-built for multi-step planning and deep logic. It features native support for function-calling, structured JSON output, and native system instructions.
Multimodal & Long Context: Natively processes images, video, and even audio (in edge models). It supports up to a 256K context window for larger models.
Versatile Architectures: Includes a 26B Mixture of Experts (MoE) model that only activates 3.8B parameters during inference for fast response times.

Because of these changes, simply dropping Gemma 4 into a Gemma 3 fine-tuning script won't work out of the box. Here is a breakdown of what needed to change in the codebase to make it work.

GPU Memory and Parameter Capacity

With the availability of NVIDIA RTX PRO 6000 GPUs on Cloud Run, we now have access to 96GB of VRAM. This is a game-changer for hosting and fine-tuning large models.

According to the formula discussed in my blog post on Decoding high-bandwidth memory: Total HBM ≈ (Model Size) + (Optimizer States) + (Gradients) + (Activations)

When using LoRA (Low-Rank Adaptation), we freeze the base model weights and only train a small subset of parameters. This means the memory-hungry gradients and optimizer states are negligible for the base model. For Gemma 4 31B loaded in 16-bit precision (bfloat16), the base model size is roughly 31 billion parameters × 2 bytes/parameter ≈ 62 GB. While this 62GB model fits comfortably within the 96GB of VRAM available on the RTX 6000 Pro, we can do even better!

By applying 4-bit quantization (QLoRA) via the bitsandbytes library, we dramatically shrink this base memory footprint to roughly 18–20GB. This leaves an enormous amount of VRAM overhead exclusively dedicated to the high-memory activations required by multi-modal processing and long-context training batches, unlocking unparalleled serverless efficiency!

Key Code Changes for Gemma 4 Migration

If you are updating your own script or starting fresh, these are the critical adjustments made to the pipeline:

1. Multimodal Input Ordering & Integrated Instructions

While Gemma 4 supports interleaved inputs and a native system role, we recommend providing the image data before the text as a stable convention and merging instructions into the user prompt for this pipeline. We found this 'single-turn' structure more effective for maintaining instruction-following precision and simplifying our custom masking logic.

In the code below, the {"type": "image"} entry acts as a placeholder that signals the processor to inject special image tokens into the chat template. The actual image tensors are then passed separately during the data collation step to ensure the multimodal architecture is adapted correctly.

full_user_content = f"{prompt}\n\nIdentify the breed of the animal in this image."
messages = [
  {
    "role": "user",
    "content": [
      {"type": "image"},  # Image must come first!
      {"type": "text", "text": full_user_content},
    ],
  },
  {
    "role": "assistant",
    "content": [{"type": "text", "text": example["caption"]}]
  }
]

2. Loading the Correct Multimodal Architecture

Gemma 4 natively processes images, video, and even audio (in the E2B and E4B models), which changes how the model must be loaded. To correctly handle these diverse inputs, we explicitly use the AutoModelForMultimodalLM class. While AutoModelForImageTextToText remains a valid option for purely image-based tasks, the multimodal class is the more precise choice for the Gemma 4 architecture, ensuring it is ready to handle video and audio data natively.

from transformers import AutoModelForMultimodalLM
model = AutoModelForMultimodalLM.from_pretrained(model_id, **model_kwargs)

3. Label Masking for Multimodal Data

In Gemma 3, we could hardcode specific token IDs to find where the assistant's response started to mask the prompt. For Gemma 4, we initially tried tokenizing the text prompt separately to find its length, but hit a major snag.

Gemma 4 is highly efficient with media: each image gets a dynamic number of soft tokens exactly fitted to its content. While these image soft tokens are highly stable and pre-computable (their count does not change whether the image is alone or accompanied by text), standard tokenizers can still introduce slight boundary quirks when concatenating text and control tokens after these media tokens. If you tokenize the prompt in isolation, the length might be slightly off compared to the fully assembled chat template, tanking the model's accuracy.

To achieve the highest precision, we implemented a bulletproof backward-search collator. Instead of trying to calculate the prompt length, we search the full _input_ids_ array for the exact tokens of our breed name label. Once found, we step backwards to locate the <|turn> control token that marks the start of the assistant's response, and mask everything before it. This mathematically guarantees the model is trained exactly on the required template structure and the label, without any masking misalignment.

4. Bypassing Custom Layers & Unlocking the Vision Tower

This was the most critical breakthrough! The official Hugging Face implementation for Gemma 4 uses a custom neural network wrapper called Gemma4ClippableLinear for its projection layers. This custom class wraps a standard nn.Linear layer but adds specific logic to clip minimum and maximum activations (input_min, output_max, etc.) to stabilize training.

When we tried to apply standard LoRA by targeting specific layer names like q_proj or v_proj, we hit two major issues:

Activation Clipping Bypass: Standard PEFT/LoRA doesn't natively recognize Gemma4ClippableLinear. If forced to attach to the inner .linear weights, it bypasses the parent wrapper entirely. Without that crucial activation clipping during the forward pass, the model's activations become unstable, and the training loss explodes.
Frozen Vision Tower: Even if we fixed the text backbone, standard text-focused LoRA configurations often miss the vision tower's projection layers, leaving the model's "eyes" frozen during training.

The solution is to use the macro target_modules="all-linear". This tells the PEFT library to recursively scan the entire model tree. It safely identifies and wraps nested linear layers without breaking the outer Gemma4ClippableLinear clipping logic. Crucially, it also ensures that every linear layer across both the language model and the vision tower is adapted to your data, without sacrificing architectural stability.

5. Results

By combining the multimodal architecture, bulletproof masking, and full-tower LoRA, we achieved a nice improvement in the model accuracy.

Note that Gemma 4 baseline performance (89% accuracy) was significantly higher than Gemma 3 Baseline performance (67% accuracy) so in this case the accuracy improvement is more modest, but still significant.

Intermediate Results (700 Samples, ~50 minutes Run)

Even with a small subset of 700 training images, we saw a nice boost over the baseline in less than one hour:

Results on 700 training samples and 200 evaluation samples

Final Results (Full Dataset, ~4.25 Hours Run)

Running the full Oxford-IIIT Pet dataset (~4,000 training images and 3,669 evaluation images) yielded our peak performance (STOA for this dataset is 94% accuracy):

Results on 4000 training samples and 3669 evaluation samples

In this run, we utilized a more aggressive LoRA configuration than typical text-only runs: a Rank 64 / Alpha 64 setup with a 5e-5 learning rate. This gave the model enough "surface area" to refine its visual features for the specific nuances of the pet dataset.

6. Managing VRAM with QLoRA & Gradient Checkpointing

While 96GB of VRAM on the RTX 6000 Pro is massive, training a 31B parameter model with LoRA still pushes the boundaries of a single GPU. To ensure absolute stability and prevent Out-Of-Memory (OOM) errors during the backward pass, our script implements a two-pronged optimization strategy:

QLoRA (4-bit Quantization): Utilizing BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4") to drastically reduce the model's footprint when loaded on CUDA.
Gradient Checkpointing: Specifically enabled for the 31B model, this trades a slight increase in compute time for a significant reduction in VRAM usage by recalculating activations instead of storing them all in memory.

The Complete Fine-Tuning Workflow on Cloud Run

Before you begin the fine-tuning process, ensure you have the following software and environment configurations in place.

Prerequisites

Google Cloud Project with billing enabled and APIs active (Cloud Run, Artifact Registry, Cloud Build, Secret Manager).
NVIDIA RTX PRO 6000 availability in your region (e.g., europe-west4).
Hugging Face Token: A valid token with access to the Gemma 4 model weights.

Step 0: Set Environment Variables

Set the following environment variables to align with the steps below:

export PROJECT_ID=[YOUR_PROJECT_ID]
export REGION=europe-west4
export HF_TOKEN=[YOUR_HF_TOKEN]
export SERVICE_ACCOUNT="finetune-gemma-job-sa"
export BUCKET_NAME=$PROJECT_ID-gemma4-finetuning-eu
export AR_REPO=gemma4-finetuning-repo
export SECRET_ID=HF_TOKEN
export IMAGE_NAME=gemma4-finetune
export JOB_NAME=gemma4-finetuning-job

Step 1: Get the Code

Whether you're running locally or on the cloud, you'll need the code. Clone the repository and navigate to the project directory:

git clone https://github.com/GoogleCloudPlatform/devrel-demos
cd devrel-demos/ai-ml/finetune_gemma/

Step 2: Test Locally Before Cloud Deployment

Before spinning up massive GPUs in the cloud, it is always a best practice to verify your pipeline locally using a smaller model variant (like the 2B IT model) on a subset of the data.

To run a local CPU test, first activate your virtual environment:

source .venv/bin/activate

Then, execute the script with a very small dataset to ensure the pipeline completes successfully:

python3 finetune_and_evaluate.py \
  --model-id google/gemma-4-e2b-it \
  --device cpu \
  --train-size 20 \
  --eval-size 20 \
  --gradient-accumulation-steps 4 \
  --num-epochs 1

Once you verify that the training pipeline completes successfully, you are ready to scale up to Cloud Run!

Step 3: Stage the Model in GCS

To save startup time and avoid repetitive downloads from the internet during training, stage the model weights (e.g., google/gemma-4-31b-it) in a GCS bucket located in the same region as your Cloud Run job. We provide a utility script within the repository to perform this transfer directly:

# Navigate to the utility directory
cd hf-to-gcs
# Execute the transfer script
python3 hf_to_gcs.py \
  --model-id google/gemma-4-31b-it \
  --bucket $BUCKET_NAME \
  --hf-token $HF_TOKEN

This script ensures that the weights are stored in your project's bucket, enabling high-speed access via volume mounts when the Cloud Run job executes.

Step 4: Build the Container

Use Cloud Build to package your script and dependencies into a container image compatible with CUDA 12.8:

gcloud builds submit --tag $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME:latest .

[!TIP] You can track the real-time progress of your build in the Cloud Build console.

Step 5: Create and Execute the Cloud Run Job

Create the job with GPU support and volume mounts for the GCS bucket holding the model:

gcloud beta run jobs create gemma4-finetuning-job \
  --region $REGION \
  --image gcr.io/$PROJECT_ID/gemma4-finetune \
  --gpu 1 \
  --gpu-type nvidia-rtx-pro-6000 \
  --cpu 30.0 \
  --memory 120Gi \
  --labels dev-tutorial=finetune-gemma \
  --add-volume name=model-volume,type=cloud-storage,bucket=$BUCKET_NAME \
  --add-volume-mount volume=model-volume,mount-path=/mnt/gcs \
  --args="--model-id","/mnt/gcs/google/gemma-4-31b-it/","--output-dir","/mnt/gcs/gemma4-finetuned","--train-size","700","--eval-size","200","--merge"

Then execute it:

gcloud beta run jobs execute gemma4-finetuning-job --region $REGION --async

Conclusion

Migrating to Gemma 4 requires handling its new architecture and response formats, but the effort pays off with its superior reasoning and adherence to instructions. By leveraging Cloud Run Jobs and Serverless Blackwell GPUs, you can train these massive models efficiently without managing servers.

To get started with inference, explore this codelab: Run inference of Gemma 4 model on Cloud Run with RTX 6000 Pro GPU with vLLM.

To learn more about production serving, refer to the Cloud Run Gemma 4 documentation.

Happy fine-tuning! 🎉

Special thanks to Ryan Mullins, Juyeong Ji and Gus Martins from the Gemma 4 team for the helpful review and feedback on this blog.

Fine-Tuning Gemma 3 with Cloud Run Jobs: Serverless GPUs (NVIDIA RTX 6000 Pro) for pet breed classification 🐈🐕

Shir Meir Lador — Thu, 09 Apr 2026 13:07:00 +0000

Architectural worklow: fine tuning Gemma 3 27B on Cloud Run Jobs

Recently, I was inspired by a major new release on Google Cloud: the availability of NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs on Cloud Run Jobs. This launch is important because it unlocks the ability to tackle fine-tuning workloads for open models with the simplicity of a serverless batch job. To put this new hardware to the test in a fun way, I fine tuned a multi-modal model to identify a pet’s breed from a photo using The Oxford-IIIT Pet Dataset. This model could be used for a “Smart pet care” — an AI application that identifies a pet’s breed from a photo and provides tailored health and nutrition advice.

Image taken from The Oxford-IIIT Pet Dataset and showcase the images of cats and dogs and their corresponding breed — the classification label

Why Fine-Tuning?

In a recent Agent Factory episode, we discussed that while foundational models are a powerful ‘one-size-fits-all’ starting point, they essentially remain generalists. You should consider fine-tuning when you have a problem that requires high specialization that a generalist model might not excel in on its own, or when you need more control and cost-efficiency over your own hosting.

For this pet-care use case, distinguishing between 37 different breeds isn’t just about ‘knowledge’, it’s about taking that foundational reasoning and adding a specific capability based on a unique dataset. As we explored in the episode and as mentioned in this Nvidia paper, this kind of specialization is what allows smaller, focused models to become sufficiently powerful and economical for production agentic systems. Fine-tuning acts as the necessary bridge, transforming a broad reasoner into a high-precision classification expert.

Bridging Reasoning and Precision

For this project, I chose the multimodal breadth of Gemma 3 27B. While specialized vision models often provide superior accuracy for narrow identification tasks, I wanted to use a model capable of both identifying breeds and reasoning about the specific health and dietary needs associated with them. By leveraging the power of the new Blackwell GPUs, I was able to fine-tune this model to bridge the performance gap, all while keeping the setup reproducible, cost-effective, and entirely container-native.

From Batch to Production: Economically Efficient Hosting

The true ‘deploy and forget’ magic happens after the weights are saved. With high-performance inference now supported on Cloud Run, you can host your fine-tuned Gemma 3 27B model on the same NVIDIA RTX PRO 6000 Blackwell GPU without managing any underlying infrastructure. This setup delivers a highly economical production environment: Cloud Run automatically scales your GPU instances to zero when they aren’t in use, ensuring you only pay for the exact minutes your model is active.

In this guide, I’m excited to show you how this new hardware release transforms complex fine-tuning into a scalable, serverless experience without the need to manage complex clusters or maintain idle instances.

Simplifying 27B Fine-Tuning on Cloud Run

Fine-tuning an open model can seem like a daunting task that requires complex orchestration, from provisioning high-capacity VMs and manually installing CUDA drivers to managing tedious data transfers and scaling down manually to control costs. Cloud Run Jobs elegantly solves this by allowing you to package your training logic as a container, now backed by the fully managed environment of NVIDIA RTX PRO 6000 Blackwell GPUs and their 96GB of VRAM.

This setup delivers on-demand availability without the need for reservations, rapid 5-second startup times with drivers pre-installed, and automatic scale-to-zero efficiency that ensures you only pay for the minutes your model is training. By leveraging built-in GCS volume mounting for high-speed access to model weights, we can now move past infrastructure hurdles and focus on the core task: fine-tuning Gemma 3 27B to achieve high-precision results for Pet Breed Classification on the Oxford-IIIT Pet Dataset.

If you’d like to dive straight into the code, you can clone the repository here.

Prerequisites

Before you begin the fine-tuning process, ensure you have the following software and environment configurations in place.

Python 3.12+
uv (Python package manager): will be used to manage our local Python environment and speed up our Docker builds. Use curl to download the script and execute it with sh:

curl -LsSf https://astral.sh/uv/install.sh | sh

Google Cloud SDK (gcloud CLI) installed and authenticated.
A Google Cloud Project with billing enabled.
APIs Enabled Ensure the following APIs are active in your project: Cloud Run Admin API, Artifact Registry API, Cloud Build API, Secret Manager API, Compute Engine API (for GPU provisioning)
Hugging Face Token: A valid token with access to the Gemma 3 27B-IT model weights.

Access to gated models: Gemma 3 27B-IT is a gated model, which means you must explicitly accept the terms of use before you can download or fine-tune the weights.

Accept the License: Visit the Gemma 3 27B-IT model page on Hugging Face and click the “Agree and access repository” button.
Generate a Token: Once access is granted, ensure your Hugging Face Token has “read” permissions (or “write” if you plan to push your fine-tuned model back to the Hub) to authenticate your training job.

Step 1 — Setting the stage: Your environment

Step 1.1 — Prepare your Google Cloud environment

Set environment variables.

[!IMPORTANT] Regional Alignment is Critical: To use Cloud Storage volume mounting, your GCS bucket must be in the same region as your Cloud Run job. We recommend using europe-west4 (Netherlands) as it supports the RTX PRO 6000 Blackwell GPU and ensures zero-latency access to your model weights.

export PROJECT_ID=YOUR_PROJECT_ID
export REGION=europe-west4
export HF_TOKEN=YOUR_HF_TOKEN
export SERVICE_ACCOUNT="finetune-gemma-job-sa"
export BUCKET_NAME=$PROJECT_ID-gemma3-finetuning-eu
export AR_REPO=gemma3-finetuning-repo
export SECRET_ID=HF_TOKEN
export IMAGE_NAME=gemma3-finetune
export JOB_NAME=gemma3-finetuning-job

Step 1.2 — Get the code

Whether you’re running locally or on the cloud, you’ll need the code. After you open Cloud Shell or install your local Google Cloud CLI, you need to clone the repository. The finetune_gemma repository contains the finetune_and_evaluate.py script, a Dockerfile, and the requirements.txt file to your machine.

git clone https://github.com/GoogleCloudPlatform/devrel-demos
cd devrel-demos/ai-ml/finetune_gemma/

gcloud auth login

Set your Project:

gcloud config set project $PROJECT_ID

Create the service account and grant storage permissions:

gcloud iam service-accounts create $SERVICE_ACCOUNT \
  --display-name="Service Account for Gemma 3 fine-tuning"

gcloud storage buckets create gs://$BUCKET_NAME --location=$REGION

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_NAME \
  --member=serviceAccount:$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com \
  --role=roles/storage.objectAdmin

Create an Artifact Registry repository and store your HF Token in Secret Manager:

gcloud artifacts repositories create $AR_REPO \
    --repository-format=docker \
    --location=$REGION \
    --description="Gemma 3 finetuning repository"

# Create the secret (ignore error if it already exists)
gcloud secrets create $SECRET_ID --replication-policy="automatic" || true

printf $HF_TOKEN | gcloud secrets versions add $SECRET_ID --data-file=-

gcloud secrets add-iam-policy-binding $SECRET_ID \
  --member serviceAccount:$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com \
  --role='roles/secretmanager.secretAccessor'

Step 2 — Staging the Model with cr-infer (Recommended)

To avoid downloading the model every time the job runs, we’ll stage the Gemma 3 27B weights in Google Cloud Storage. We’ll use cr-infer, which allows you to run model transfers directly via uvx without needing a local installation.

Before running the transfer, you must set up your Application Default Credentials. This is required for running scripts locally. In this case it allows the cr-infer tool to use your local identity to write the weights to your GCS bucket.

gcloud auth application-default login

Download Gemma 3 27B to GCS: Now, execute the transfer using uvx. This clones the model into gs://$BUCKET_NAME/google/gemma-3–27b-it/, allowing our Cloud Run job to mount the weights as a local volume and save gigabytes of container startup time

uvx — from git+https://github.com/oded996/cr-infer.git cr-infer model download \- source huggingface \
 - model-id google/gemma-3–27b-it \
 - bucket $BUCKET_NAME \
 - token $HF_TOKEN

Step 3 — Build and push the container image

Our Dockerfile leverages uv for fast dependency installation.

Option A: Use Google Cloud Build (Recommended — No local Docker needed)

This is the easiest way to build your image directly in the cloud and push it to Artifact Registry. (The build typically takes 10–15 minutes as it downloads large ML dependencies like PyTorch).

gcloud builds submit — tag $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME:latest .

[!TIP] You can track the real-time progress of your build in the Cloud Build console.

Option B: Build locally with Docker

If you have Docker Desktop installed locally:

Install uv locally (if you haven’t already):

curl -LsSf https://astral.sh/uv/install.sh | sh

Build the image:

docker build -t $IMAGE_NAME .

Push to AR:

docker tag $IMAGE_NAME $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME
docker push $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME

Step 3.1 — Test locally (Optional)

I like to start with a quick local test run to validate the setup. It serves as a sanity check for your environment and scripts before moving the workload to Cloud Run. For this test, we use parameters optimized for speed and a smaller model, google/gemma-3–4b-it, to ensure the model correctly learns the task format:

python3 finetune_and_evaluate.py \
- model-id google/gemma-3–4b-it \
 - train-size 20 \
 - eval-size 20 \
 - gradient-accumulation-steps 2 \
 - learning-rate 2e-4 \
 - batch-size 1 \
 - num-epochs 3

On my Apple M4 Pro, running this on the CPU took about 20–30 minutes. If you want to see early signs of progress locally, you can increase the sample size — I found that a one-hour run on my Mac with 50 training and testing samples already yielded a 4% improvement in accuracy and a 3% boost in F1-score.

Results from a local run on my Mac with 50 train and 50 test samples

Inside the Fine-Tuning Script: How it Works

The finetune_and_evaluate.py script is designed to be a complete, self-contained pipeline, handling everything from data preparation to hardware-aware optimization and evaluation. Here is a look at the core logic that makes this possible:

1. Memory-Efficient Model Loading

To fit a 27B parameter model into the 96GB VRAM of the Blackwell GPU, the script uses 4-bit quantization via the bitsandbytes library. By setting low_cpu_mem_usage=True, it also ensures the model is loaded efficiently without exhausting the system RAM.

2. Vision-Language LoRA Configuration

Instead of updating all 27 billion parameters, we use LoRA (Low-Rank Adaptation). We target all the primary projection layers in the transformer blocks, allowing the model to adapt its internal representations to the visual nuances of the pet breeds while keeping the total trainable parameter count extremely low. More details on efficient GPU memory usage can be found in this blog.

3. The Custom Data Collator

This is a crucial part for fine-tuning vision-language models (VLMs). Because VLMs process a mix of image and text tokens, the data_collator ensures that the model only learns from the breed label (the model’s response). The turn marker is a structural boundary that signals the exact point where the user stops speaking and the model’s response begins. The script ensures the model learns only from the breed label by searching for the model’s turn marker in the token sequence and masking out the user’s prompt and image tokens, so they don’t contribute to the training loss.

4. Breed Extraction

Generative models often add conversational filler (e.g., “The animal in this image is a Samoyed”). Our evaluation logic includes a robust extraction heuristic that sorts class names by length. This ensures that if the model mentions “English Cocker Spaniel,” it correctly identifies the full breed rather than just matching “Cocker Spaniel”.

5. Automated GCS Archiving

Once the training completes and the final evaluation is calculated, the script doesn’t just stop. It bundles the fine-tuned LoRA adapters with the original model processor and automatically uploads the entire directory to your Google Cloud Storage bucket. This ensures your model is immediately ready for deployment or serving.

Step 4 — Create and execute the Cloud Run job

Now, we harness the power of the NVIDIA RTX PRO 6000 Blackwell GPU. Our container is built with CUDA 12.8 for full Blackwell/PyTorch 2.7 compatibility and uses an ENTRYPOINT configuration, allowing you to pass script arguments directly via the — args flag.

[!TIP] If the job already exists, use gcloud beta run jobs update instead of create.

gcloud beta run jobs create $JOB_NAME \
 - region $REGION \
 - image $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME:latest \
 - set-env-vars BUCKET_NAME=$BUCKET_NAME \
 - set-secrets HF_TOKEN=$SECRET_ID:latest \
 - no-gpu-zonal-redundancy \
 - cpu 20.0 \
 - memory 80Gi \
 - task-timeout 60m \
 - gpu 1 \
 - gpu-type nvidia-rtx-pro-6000 \
 - service-account $SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com \
 - add-volume name=model-volume,type=cloud-storage,bucket=$BUCKET_NAME \
 - add-volume-mount volume=model-volume,mount-path=/mnt/gcs \
 - network=default \
 - subnet=default \
 - vpc-egress=private-ranges-only \
 - args=" - model-id","/mnt/gcs/google/gemma-3–27b-it/"," - output-dir","/tmp/gemma3-finetuned"," - gcs-output-path","gs://$BUCKET_NAME/gemma3-finetuned"," - train-size","800"," - eval-size","200"," - learning-rate","5e-5"

Note on Execution Limits: Tasks using GPUs on Cloud Run Jobs currently have a maximum execution time of 60 minutes. To ensure this training job completes within the standard public limit, we have set the — num_epochs to 3 and restricted the — train-size to 800 samples. If your specific fine-tuning workload requires more time, you can sample your training dataset into segments that fit in under 60 minutes (like 800 samples in our case) and process them as a sequence of independent tasks while using checkpointing for the model training.

Understanding the Deployment Flags

To ensure a stable and production-ready environment, we use several specialized flags:

— gpu-type nvidia-rtx-pro-6000: Targets the NVIDIA RTX PRO 6000 Blackwell GPU. With 96GB of GPU memory (VRAM), 1.6 TB/s bandwidth, and support for FP4/FP6 precision, it provides the ample overhead and high-speed throughput needed for multimodal fine-tuning.
— memory 80Gi: We allocate high system RAM (scalable up to 176GB) to handle the low_cpu_mem_usage model loading and our memory-efficient streaming data generator.
— cpu 20.0: Cloud Run Jobs allows scaling up to 44 vCPUs per instance, ensuring that preprocessing and data loading never become a bottleneck for the GPU.
— add-volume & — add-volume-mount: This mounts your GCS bucket as a local directory at /mnt/gcs. Note: This requires the bucket and the job to be in the same region (europe-west4). It allows the script to read the base model weights at data-center speeds without copying them into the container’s writable layer.
— network & — subnet: Configures Direct VPC Egress, allowing the job to communicate securely with other resources in your VPC. To make sure this works you need to enable “Private Google Access”.
— vpc-egress=all-traffic: Ensures all outgoing traffic, including requests to Hugging Face, is routed through your VPC for enhanced security and monitoring.

[!TIP] If you skipped Step 2 and didn’t stage the model in your GCS bucket, you must change the — model-id in the — args to google/gemma-3–27b-it. This tells the script to download the weights directly from Hugging Face at runtime, though this will be significantly slower than using the GCS mount

Execute the job:

gcloud beta run jobs execute $JOB_NAME — region $REGION — async

Step 5 — Check Results and Evaluate Performance

Once your job finishes, you can jump into the Google Cloud Console to inspect the detailed logs. You’ll find your newly fine-tuned model waiting for you in your Cloud Storage bucket at gs://$BUCKET_NAME/gemma3-finetuned.

To rigorously quantify how well Gemma 3 learned to identify these breeds, we used Accuracy and Macro F1 Score as our primary metrics. While accuracy gives us a clear overall percentage, the F1 score ensures the model is accurate across all 37 breeds, not just the most common ones.

In my testing, I saw a clear progression as we scaled our data and compute:

Results with different sample size

79% Accuracy, 77% F1-score (1.1h run): Trained on 1,000 samples and evaluated against 200 test samples, this was a significant jump from the zero-shot baseline of 66%.
93% Accuracy, 91% F1-score (2.3h run): By scaling up to 2,500 training samples (and 1,500 test samples), the model reached nearly state-of-the-art performance.
94% Accuracy & 91.5% F1 (3.3h run): With a larger run on 3,600 training samples (evaluated against 3,500 test samples), the model effectively hit the state-of-the-art benchmark for this dataset.

Performance summary report for 3600 train samples and 3500 test sample — reached state of the art with 94% accuracy!

It is important to note that the standard public limit for GPU jobs is currently 60 minutes. As mentioned in step 4, sampling and checkpointing can help overcome this limitation.

These results prove that fine-tuning is the necessary bridge for generalist models, by leveraging serverless Blackwell GPUs, we’ve transformed a massive reasoner into a high-precision expert ready for production

Next Steps: Serving your fine-tuned model on Cloud Run

Now that you’ve fine-tuned Gemma 3, the next challenge is serving it efficiently for production-grade inference.

The true “deploy and forget” magic happens when you transition your saved weights into a serving environment. By hosting your fine-tuned model on Cloud Run with serverless Blackwell GPUs, you get a highly economical production environment where your GPU instances automatically scale to zero when they aren’t in use. This setup eliminates the operational toil of cluster management and manual maintenance, allowing you to serve massive models with no reservations, you only pay for the exact minutes your model is active.

To get started with inference, explore this codelab: Run inference using a Gemma model on Cloud Run with RTX 6000 Pro GPU.

To learn more about production serving, refer to the official guide on Running Gemma 3 on Cloud Run. The documentation provides a comprehensive roadmap for building a robust inference service, including:

Optimized Deployment: Instructions for serving Gemma models using GPU accelerators and loading model weights via high-speed Cloud Storage volume mounts.
Secure Interaction: Guidance on using IAM authentication to securely call your deployed service with the Google Gen AI SDK.
Performance Configuration: Best practices for setting concurrency to achieve optimal request latency and high GPU utilization

Special thanks to Sara Ford and Oded Shahar from the Cloud Run team for the helpful review and feedback on this article.

Agent Factory Recap: Supercharging Agents on GKE with Agent Sandbox and Pod Snapshots

Shir Meir Lador — Tue, 07 Apr 2026 13:04:00 +0000

In the latest episode of the Agent Factory, Mofi Rahman and I had the pleasure of hosting, Brandon Royal, the PM working on agentic workloads on GKE. We dove deep into the critical questions around the nuances of choosing the right agent runtime, the power of GKE for agents, and the essential security measures needed for intelligent agents to run code.

This post guides you through the key ideas from our conversation. Use it to quickly recap topics or dive deeper into specific segments with links and timestamps.

Why GKE for Agents?

Timestamp: 01:49

We kicked off our discussion by tackling a fundamental question: why choose GKE as your agent runtime when serverless options like Cloud Run or fully managed solutions like Agent Engine exist?

Brandon explained that the decision often boils down to control versus convenience. While serverless options are perfectly adequate for basic agents, the flexibility and governance capabilities of Kubernetes and GKE become indispensable in high-scale scenarios involving hundreds or thousands of agents. GKE truly shines when you need granular control over your agent deployments.

ADK on GKE

Timestamp: 06:58

We've discussed the Agent Development Kit (ADK) in previous episodes, and Mofi highlighted to us how seamlessly it integrates with GKE and even showed a demo with the agent he built. ADK provides the framework for building the agent's logic, traces, and tools, while GKE provides the robust hosting environment. You can containerize your ADK agent, push it to Google Artifact Registry, and deploy it to GKE in minutes, transforming a local prototype into a globally accessible service.

The Sandbox problem

Timestamp: 15:20

As agents become more sophisticated and capable of writing and executing code, a critical security concern emerges: the risk of untrusted, LLM-generated code. Brandon emphasized that while code execution is vital for high-performance agents and deterministic behavior, it also introduces significant risks in multi-tenant systems. This led us to the concept of a "sandbox."

What is a Sandbox?

Timestamp: 19:18

For those less familiar with security engineering, Brandon clarified that a sandbox provides kernel and network isolation. Mofi further elaborated, explaining that agents often need to execute scripts (e.g., Python for data analysis). Without a sandbox, a hallucinating or prompt-injected model could potentially delete databases or steal secrets if allowed to run code directly on the main server. A sandbox creates a safe, isolated environment where such code can run without harming other systems.

Agent Sandbox on GKE Demo

Timestamp: 20:25

So, how do we build this "high fence" on Kubernetes? Brandon introduced the Agent Sandbox on Kubernetes, which leverages technologies like gVisor, an application kernel sandbox. When an agent needs to execute code, GKE dynamically provisions a completely isolated pod. This pod operates with its own kernel, network, and file system, effectively trapping any malicious code within the gVisor bubble.

Mofi walked us through a compelling demo of the Agent Sandbox in action.We observed an ADK agent being given a task requiring code execution. As the agent initiated code execution, GKE dynamically provisioned a new pod, visibly labeled as "sandbox-executor," demonstrating the real-time isolation. Brandon highlighted that this pod is configured with strict network policies, further enhancing security.

The Future: Pod Snapshots

Timestamp: 29:39

While the Agent Sandbox offers incredible security, the latency of spinning up a new pod for every task is a concern. Mofi demoed the game-changing solution: Pod Snapshots. This technology allows us to save their state of running sandboxes and then near-instantly restore them when an agent needs them. Brandon noted that this reduces startup times from minutes to seconds, revolutionizing real-time agentic workflows on GKE.

Conclusion

It's incredible to see how GKE isn't just hosting agents; it's actively protecting them and making them faster.

Your turn to build

Ready to put these concepts into practice? Dive into the full episode to see the demos in action and explore how GKE can supercharge your agentic workloads.

Learn how to deploy an ADK agent to Google Kubernetes Engine and how to get your run agent to run code safely using the GKE agent Sandbox.

Connect with us

Shir Meir Lador → LinkedIn, X
Mofi Rahman → LinkedIn
Brandon Royal → LinkedIn

Agent Factory Recap: Reinforcement Learning and Fine-Tuning on TPUs

Shir Meir Lador — Tue, 31 Mar 2026 18:56:42 +0000

In our agent factory holiday special, Don McCasland and I were joined by Kyle Meggs, Senior Product Manager on the TPU Training Team at Google, to dive deep into the world of model fine tuning. We focused specifically on reinforcement learning (RL), and how Google's own infrastructure of TPUs are designed to power these massive workloads at scale.

This post guides you through the key ideas from our conversation. Use it to quickly recap topics or dive deeper into specific segments with links and timestamps.

When to Consider Fine-Tuning

Timestamp: 3:13

We started with a fundamental question: with foundational models like Gemini becoming so powerful out of the box, and customization through the prompt can often be good enough, when should you consider fine-tuning?

Fine tuning your own model is relevant when you need high specialization for unique datasets where a generalist model might not excel (such as in the medical domain), or when you have strict privacy restrictions that require hosting your own models trained on your data.

The Model Lifecycle: Pre-training and Post-training (SFT and RL)

Timestamp: 3:52

Kyle used a great analogy inspired by Andrej Karpathy to break down the stages of training. He described pre-training as "knowledge acquisition," similar to reading a chemistry textbook to learn how things work. Post-training is further split into Supervised Fine-Tuning (SFT), which is analogous to reading already-solved practice problems within the textbook chapter, and Reinforcement Learning (RL), which is like solving new practice problems without help and then checking your answers in the back of the book to measure yourself against an optimal approach and correct answers.

Why Reinforcement Learning (RL) is Essential

Timestamp: 5:50

We explored why RL is currently so important for building modern LLMs. Kyle explained that unlike SFT, which is about imitation, RL is about grading actions to drive "alignment." It’s crucial for teaching a model safety (penalizing what not to do), enabling the model to use tools like search and interact with the physical world through trial and error, and for performing verifiable tasks like math or coding by rewarding the entire chain of thought that leads to a correct answer.

The Agent Industry Pulse: Why 2025 is the year of RL

Timestamp: 8:33

In this segment, we looked at the rapidly evolving landscape of RL. Kyle noted that it is fair to call 2025 the "year of RL," highlighting the massive increase in investment and launches across the industry:

January: DeepSeek-R1 launched, making a huge splash with open-source GRPO.
Summer: xAI launched Grok 4, reportedly running a 200k GPU cluster for RL at "pre-training scale."
October: A slew of new tooling launches across Google, Meta, and TML.
November: Gemini 3 launched as a premier thinking model.
Recent: Google launched MaxText 2.0 for fine-tuning on TPUs.

The Hurdles of Implementing RL

Timestamp: 10:46

Following the industry trends, we discussed why RL is so difficult to implement. Kyle explained that RL combines the complexities of both training and inference into a single process. He outlined three primary challenges: managing infrastructure at the right balance and scale to avoid bottlenecks; choosing the right code, models, algorithms (like GRPO vs. DPO), and data; and finally, the difficulty of integrating disparate components for training, inference, orchestration, and weight synchronization.

To provide a solution across these dimensions of complexity, Google offers MaxText, a vertically integrated solution to help you perform RL in a highly scalable and performant fashion. MaxText provides highly optimized models, the latest post-training algorithms, high performance inference via LLM, and powerful scalability/flexibility via Pathways.

In contrast to DIY approaches where users assemble their own stack of disparate components from many different providers, Google’s approach offers a single integrated stack of co-designed components, from silicon to software to solutions.

The Factory Floor

The Factory Floor is our segment for getting hands-on. Here, we moved from high-level concepts to practical code with a live demo.

Why TPUs Shine for RL

Timestamp: 12:52

Before diving into the demo, Kyle explained why TPUs are uniquely suited for complex AI workloads like RL. Unlike other hardware, TPUs were designed system-first. A TPU Pod can connect up to 9,216 chips over low-latency interconnects, allowing for massive scale without relying on standard data center networks. This is a huge advantage for overcoming RL bottlenecks like weight synchronization. Furthermore, because they are purpose-built for AI, they offer superior price-performance and thermal efficiency.

Demo: Reinforcement Learning (GRPO) with TPU

Timestamp: 15:53

Don led a hands-on demonstration showing what RL looks like in action using Google's infrastructure. The demo showcased:

Using MaxText 2.0 as an integrated solution for the workload.
Leveraging models from MaxText and algorithms from Tunix.
Handling inference using vLLM.
Utilizing Pathways for orchestration and scaling to run GRPO (Group Relative Policy Optimization).

Conclusion

This holiday special was a great deep dive into the cutting edge of model fine tuning. While foundational models are getting better every day, the future of highly specialized, capable agents relies on mastering post-training techniques like RL, and having the right vertically integrated infrastructure, like TPUs, to run them efficiently.

Your turn to build

We hope this episode gave you valuable tools and perspectives to think about fine-tuning your own specialized agents. Be sure to check out the resources below to explore MaxText 2.0 and start experimenting with TPUs for your workloads. We'll see you next year for a revamped season of The Agent Factory!

Resources

Post-Training Docs https://maxtext.readthedocs.io/en/latest/tutorials/post_training_index.html

Google Cloud TPU (Ironwood) Documentation: https://docs.cloud.google.com/tpu/docs/tpu7x
Google Cloud open source code:
- MaxText - https://github.com/AI-Hypercomputer/maxtext
- GPU recipes - https://github.com/AI-Hypercomputer/gpu-recipes
- TPU recipes - https://github.com/AI-Hypercomputer/tpu-recipes
Andrej Karpathy - Chemistry Analogy: Deep Dive into LLMs like ChatGPT
Paper: "Small Language Models are the Future of Agentic AI" (Nvidia): https://arxiv.org/abs/2506.02153
Fine tuning blog: https://cloud.google.com/blog/topics/developers-practitioners/a-step-by-step-guide-to-fine-tuning-medgemma-for-breast-tumor-classification?e=48754805

Connect with us

Shir Meir Lador → https://www.linkedin.com/in/shirmeirlador/, X
Don McCasland → https://www.linkedin.com/in/donald-mccasland/
Kyle Meggs → https://www.linkedin.com/in/kyle-meggs/