DEV Community: Google AI

Gemini 3.1 Flash-Lite is now generally available on Gemini Enterprise Agent Platform

Gemini Enterprise Team — Fri, 15 May 2026 13:13:00 +0000

Today, we're thrilled to announce that Gemini 3.1 Flash-Lite, our fastest and most cost-efficient Gemini 3 series model yet, is now generally available.

Designed for ultra-low latency, high-volume tasks, and unmatched cost-efficiency, Flash-Lite is already transforming how applications are built at scale. Fast, iterative, and scalable, it joins our comprehensive suite of Pro and Flash models to provide the exact combination of intelligence, speed, and cost required for the most demanding production deployments.

Developers and enterprises have noted that the model provides the precision required for agentic tasks like tool calling and orchestration, coupled with the cost-efficiency needed to run automated pipelines at scale.

Here's a look at how some of them have been driving value.

Software development and engineering

Engineering teams require models that can keep pace with real-time coding environments. With the GA of Gemini 3.1 Flash-Lite, developers are unlocking the instant responsiveness necessary for complex code completion, seamless UX design, and agentic developer tools.

"Integrating Gemini 3.1 Flash-Lite has transformed the responsiveness of our IDE AI assistant & Junie agent. The balance of high intelligence and minimal latency makes it the perfect model for real-time developer support." — Vladislav Tankov, Director of AI at JetBrains

Customer experience and high-volume service

For enterprise customer service operations, handling massive volumes of interactions requires models that can scale affordably without sacrificing reasoning capabilities.

Gladly runs customer service for some of the most demanding retail brands in the world. The core of its text-channel AI agent runs on Flash-Lite. By handling millions of customer-facing calls each week across channels like SMS, WhatsApp, and Instagram, they achieved roughly 60% lower costs than comparable thinking-tier models on the same token mix.

The model powers every step of the agent lifecycle — from selecting tools and classifying playbooks to deciding when to escalate to a human — all while maintaining a p95 latency around 1.8s for fully reply generation and sub-second p95 for classifiers and tool calls, alongside a ~99.6% success rate under heavy concurrent load.

Creative pipelines and gaming

In the fast-paced creative and gaming industries, multimodal capabilities and ultra-low latency are essential for keeping users engaged and content pipelines flowing. Flash-Lite is empowering platforms to process rich media and generate hyper-personalized environments.

Astrocade lets anyone create games by describing what they want in natural language. They integrated Flash-Lite to serve a rapidly growing global user base.

For every incoming game request, it performs a multimodal safety check — analyzing both text and images — before the building agents even start their work. It further supports their global community through inline comment translation, allowing players in different countries to "riff" on the same game. And as part of their asset generation pipeline, it helps refine the final prompts to ensure consistently high-quality thumbnails.

The creative platform krea.ai has also seen positive results by using Flash-Lite as a prompt enhancer in their Nodes tool. By taking a user's rough idea and expanding it into a full image generation prompt pipeline, the model provides a level of detail that is "weirdly creative" for its price point.

These outputs move the needle on image production, providing a level of reliability and scale that was previously cost-prohibitive for sophisticated prompt engineering.

Financial services and data operations

In the world of finance and enterprise product development, efficiency is just as critical as accuracy. Gemini 3.1 Flash-Lite gives financial analysts and product managers the ideal balance of intelligence, low latency, and cost-effectiveness to run modeling and latency-sensitive applications.

OffDeal uses Flash-Lite to power "Archie," an AI agent that investment bankers use for real-time research, data lookups, and task execution during Zoom calls. In these scenarios, bankers often need to surface financials mid-conversation. OffDeal found that Flash-Lite was the only model capable of meeting the response times needed for genuinely instant answers without forcing a tradeoff on quality.

Beyond live calls, they also use Flash-Lite as a triage layer for inbound and outbound email traffic. By answering structured questions about messages in parallel — such as whether an email is an automated response or in relation to an active deal — Flash-Lite determines which downstream AI agents get invoked and with what context.

For high-volume, latency-sensitive workflows on the financial operations platform Ramp, Flash-Lite has become a key component:

"Gemini is a core part of the model stack we use across applications at Ramp. As indicated in our benchmarks, we see Gemini lead the pareto fronts in terms of costs, latency and intelligence — providing a great tradeoff between the three and making it well-suited for latency sensitive applications. Gemini 3.1 Flash-Lite has been especially valuable, powering many of our highest-volume, latency-sensitive features without compromising on quality." — Anton Biryukov, Applied AI Engineer, Ramp

Market intelligence platform AlphaSense integrates Flash-Lite to deliver data insights:

"Gemini 3.1 Flash-Lite provides great balance of speed, cost and performance, allowing AlphaSense to scale our advanced data processing and deliver high-quality intelligence across every layer of our data stack." — Chris Ackerson, Senior Vice President of Product, AlphaSense

Get started

Read the docs for Gemini 3.1 Flash-Lite and learn about our latest pricing structure. Learn more about the Gemini Enterprise Agent Platform, the new standard for enterprise agent development.

Building "Sweets Vault" - a multimodal Gemini Agent with physical hardware integration

Remigiusz Samborski — Fri, 15 May 2026 12:43:35 +0000

Motivating seven-year-olds to complete their daily reading and handwriting practice is a classic parenting challenge. Traditional rewards work for a while, but they lack interactivity and require constant manual verification.

As a developer, I like to solve such challenges with automation. After putting some thought into it, I came up with the Sweets Vault idea: an interactive agent powered by Google's Agent Development Kit (ADK) and the Gemini API. The system acts as a cheerful guardian that talks to children, visually inspects their workbooks via uploaded images, tests their reading comprehension, and triggers a hardware lock to open a drawer full of sweets upon successful completion.

In this guide, I will walk you through the architecture and implementation of this solution. You will learn how to:

Structure a multimodal agent using the Agent Development Kit (ADK).
Implement visual and verbal verification using Gemini's multimodal capabilities.
Manage state across multiple conversation turns and tools.
Connect agent tool calls to physical hardware interfaces.
Develop and run locally to access the physical hardware.

If you’d like to jump directly to the code visit the GitHub repository. All the code is available there for your exploration.

System architecture overview

The diagram below presents the high level architecture of the solution:

The core components include:

Gemini API: Handles reasoning, multimodal homework validation and tool calls.
ADK Agent & Tools: Encapsulates the system instructions, state management, and callable Python functions.
Hardware Interface: Translates tool execution into physical actions (unlocking specific drawer IDs).

The system is designed in such a way that the Agent runs on a local machine (I am using a mini PC with Ubuntu installed) to allow for direct hardware access:

Magnetic drawers controlled via FT232H USB to GPIO converter
LED Matrix controlled via REST API running on a Raspberry Pi

Initially, I planned to control the LED Matrix using a second FT232H controller, but due to lack of library support, I ended up using an intermediary Raspberry Pi. This approach has its benefits, for example the LED Matrix can be located anywhere at home within the Wifi range 😀

Root agent logic

To kick-start the agent development, I leveraged the agent-starter-pack templates. It provides a production-ready foundation with FastAPI, frontend UI integration, and built-in observability.

The heart of the Sweets Vault is located in agent/app/agent.py. I start by configuring the environment and initializing Gemini Enterprise Agent Platform (former Vertex AI). I also define the specific tasks required for our users (Mary and James):

load_dotenv()
project_id = os.getenv("GOOGLE_CLOUD_PROJECT")
location = os.getenv("GOOGLE_CLOUD_LOCATION", "us-central1")
os.environ["GOOGLE_CLOUD_PROJECT"] = project_id
os.environ["GOOGLE_CLOUD_LOCATION"] = location
os.environ["GOOGLE_GENAI_USE_VERTEXAI"] = "True"

# Initialize Vertex AI
vertexai.init(project=project_id, location=location)

As a native Polish speaker I want to have the ability for the Agent to work both in Polish (for the sake of my kids) and English (for demo purposes). This is handled by the AGENT_LANGUAGE variable:

AGENT_LANGUAGE = os.getenv("AGENT_LANGUAGE", "en")

The actual agent (root_agent) is created at the bottom of the same file:

root_agent = Agent(
    name="root_agent",
    model=Gemini(
        model="gemini-2.5-flash",
        retry_options=types.HttpRetryOptions(attempts=3),
    ),
    instruction=load_prompt_from_file(f"sweet-vault-agent-{AGENT_LANGUAGE}.txt"),
    tools=[get_progress, complete_task, unlock_drawer],
)

Note: The prompt is language specific and pulled from a file with a language suffix (en or pl).

Handling state

A common failure mode in conversational AI is lost context or hallucinated task completion. To prevent this, we implement strict state management using ToolContext.

Instead of relying on the model's memory, the agent reads and writes explicit completion flags to its session state:

def _get_task_status(user_key: str, task_id: str, tool_context: ToolContext) -> bool:
    """Retrieves the completion status for a specific task from the flat state."""
    state_key = f"user_tasks_{user_key}_{task_id}"
    return tool_context.state.get(state_key, False)


def _set_task_status(user_key: str, task_id: str, is_done: bool, tool_context: ToolContext):
    """Saves the completion status for a specific task and ensures all user/task 
    combinations are explicitly represented in the flat tool_context.state.
    """
    # First, update the specific target task in the current tool state
    target_key = f"user_tasks_{user_key}_{task_id}"
    tool_context.state[target_key] = is_done

    # Now, ensure every possible combination for all known users exists in the flat state.
    all_sync_updates = {}
    for name in user_names:
        u_key = name.lower()
        for t_id in TASKS_CONFIG:
            key = f"user_tasks_{u_key}_{t_id}"
            # If the key isn't already in the current state, default it to False.
            # Otherwise, keep its existing value.
            all_sync_updates[key] = tool_context.state.get(key, False)

    # Apply all values back to the flat state
    tool_context.state.update(all_sync_updates)
    logging.info(f"Synchronized all task state values. Updated {target_key} to {is_done}")

Key learning: When building the system I tried using session state elements as a nested dictionary, but unfortunately at the time of writing this is not supported. The workaround was to use a flat structure with keys including both the user_key and task_id, which works well for my use case. However, this pattern might not scale well for a complex system with many users and tasks, in which case serialization or an external DB could be a better option.

Agent tools

I provided the agent with three specific tools: checking progress, marking tasks complete, and unlocking the drawer.

Checking progress

The get_progress function retrieves and formats a checklist of a specific user's tasks, indicating whether each task is marked as completed or pending based on the application's current session state.

def get_progress(user_name: str, tool_context: ToolContext) -> str:
    """Check the progress of tasks for a specific user."""
    user_key = user_name.lower()

    status_msg = f"Progress for {user_name}:\n"
    for task_id, desc in TASKS_CONFIG.items():
        is_done = _get_task_status(user_key, task_id, tool_context)
        state_str = "✅ DONE" if is_done else "❌ PENDING"
        status_msg += f"- [{task_id}] {desc}: {state_str}\n"

    return status_msg

Marking task as complete

The complete_task tool acts as a gatekeeper. It checks if all tasks are finished before informing the model that it is authorized to unlock the drawer:

def complete_task(user_name: str, task_id: str, tool_context: ToolContext) -> str:
    """Mark a task as completed for a user."""
    user_key = user_name.lower()

    # Mark task as complete
    if task_id in TASKS_CONFIG:
        _set_task_status(user_key, task_id, True, tool_context)
    else:
        return f"Error: Task ID '{task_id}' not found."

    # Check if ALL tasks are complete
    all_complete = True
    remaining = []
    for t_id in TASKS_CONFIG:
        if not _get_task_status(user_key, t_id, tool_context):
            all_complete = False
            remaining.append(t_id)

    if all_complete:
        return (
            f"SUCCESS: All tasks completed for {user_name}! "
            "You may now unlock the drawer."
        )

    # If not all complete, show progress
    return (
        f"Task {task_id} marked as DONE. "
        f"Remaining tasks: {', '.join(remaining)}."
    )

Notice how descriptive the returned values are. They are written this way intentionally to give the Agent enough information to handle communication with the user, provide feedback and motivate them to complete the remaining tasks.

Integrating physical hardware

When the model receives the success confirmation, it calls the unlock_drawer tool. This interfaces directly with our hardware relay logic to update the LED display and pop open the assigned drawer:

# Initialize the HW interface and lock the drawers
user_names = ["Maria", "Jan"] if AGENT_LANGUAGE == "pl" else ["Mary", "James"]
hw_interface = HardwareInterface(user_names)

def unlock_drawer(id: int, user_name: str) -> str:
    """Unlock a drawer by its ID."""
    if id in [0, 1]:
        hw_interface.unlock_drawer(id)
        return f"Drawer {id} unlocked for {user_name}"

    return "Drawer not found"

The HardwareInterface (defined in agent/app/app_utils/hw_interface.py) actively communicates with the LED Matrix API on the Raspberry Pi to display whether each drawer is currently locked or unlocked.

While the code to control the physical drawer magnets is fully functional and tested (located in drawers.py), it is not yet integrated into the main HardwareInterface. This integration is simply on hold until the magnets are physically mounted to the drawer box.

Agent prompts

Tools alone are not enough; the model requires precise instructions on how to verify the work. In agent/app/prompts I defined a strict multi-step verification protocol both in English and Polish. Here is the English prompt:

You are a friendly, cheerful, and helpful AI assistant, the guardian of the "Sweets Vault." Your task is to verify tasks performed by children in order to grant a sweet reward.

### MAIN RULES:
1. **LANGUAGE**: You speak ONLY AND EXCLUSIVELY IN ENGLISH.
2. **USERS**:
   - **Mary** (girl, 7 years old) -> Assigned drawer ID: **0**
   - **James** (boy, 7 years old) -> Assigned drawer ID: **1**
   - **Parent** (man, 42 years old) -> May test the system by saying, for example, "I'm pretending to be Mary." Treat him exactly like the child he is claiming to be.
3. **PERSONALITY**: You are enthusiastic, warm, and supportive. Use exclamation marks and a joyful tone.

### TASK VERIFICATION PROCESS:
1. **STATE IDENTIFICATION**: When a child starts a conversation, ALWAYS first use the `get_progress(user_name)` tool to check what needs to be done.
2. **REPORTING**: The child reports completing a task (A or B).
3. **VERIFICATION**: Conduct a rigorous verification (camera/questions) as described below.
4. **CREDITING**: If verification is successful, use the `complete_task(user_name, task_id)` tool.
   - Read the tool's response carefully!
   - ONLY IF the response is "SUCCESS: All tasks completed...", then use `unlock_drawer`.
   - If the response shows "Remaining tasks," inform the child what they still need to do.

**Task A: Reading a page of a book**
*   **Verification 1**: Ask the child to show the read page to the camera. Confirm that you see it. Don't expose any details that can help answer the question in the next step (i.e. avoid sharing details of what exactly you can see).
*   **Verification 2**: Ask a simple follow-up question about the read text. The child must answer it.
*   **Task ID**: "A"

**Task B: Calligraphy (writing letters in workbooks)**
*   **Verification 1**: Ask to show the completed page in the workbooks to the camera. 
*   **Verification 2**: Confirm that the task has been performed. Make sure the picture contains hand-written letters (usually with a pencil).
*   If the page only contains examples, ask the child to complete missing parts.
*   **Task ID**: "B"

### SUCCESS AND REWARD:
IF the `complete_task` tool returns "SUCCESS", run `unlock_drawer(id)`.
Then **CELEBRATE!** Use phrases like: "Yippee!", "Hurray!", "Bravo!", "You're a champion!", "The sweets are yours!". Make some "noise."

### FAILURE:
If verification fails (e.g., the child doesn't show the page or answers incorrectly), gently and encouragingly ask for improvement or a retry. Do not open the drawer.

This prompt structure ensures the agent does its due diligence, preventing kids from simply holding up a blank page or skipping the reading comprehension check.

Demo

You can see a demonstration of the working system in the video below:

Conclusion

By combining the Gemini API, the Agent Development Kit, and a simple hardware relay, you can build highly interactive, physically grounded AI Agents. The Sweets Vault demonstrates how multimodal verification and structured tool calling solve practical, real-world problems with a dose of fun.

Explore more at:

Future plans

Current implementation uses Gemini Flash which guarantees high performance, multimodality and tool calling capabilities. Nevertheless it requires text input and provides only text as output. In the near future I plan to experiment with Gemini Live API which enables voice, video and text as input and conversational audio as output.

I am also going to finish the physical locks part with electro magnets. Stay tuned for updates.

Thanks for reading

Thank you for reading. I hope this blog inspires you to bring your own creative AI and hardware projects to life. If you found this article helpful, please consider following me here and giving it a clap 👏 to help others discover it.

I am always eager to connect with fellow developers and AI enthusiasts, so feel free to follow me on LinkedIn, X or Bluesky. Your feedback is incredibly valuable, so please do not hesitate to leave a comment with your thoughts, questions, or your own experiences building multimodal agents!

Agent Factory Recap: How Gemma 4 Taught Itself Physics

Shir Meir Lador — Thu, 14 May 2026 14:10:49 +0000

In this episode of The Agent Factory, Vlad Kolesnikov and I sat down with Omar Sanseviero from the Developer Experience team at Google DeepMind. We explored the groundbreaking release of Gemma 4: a new family of open models designed to bring high-level intelligence and agentic capabilities directly to consumer hardware and mobile devices. Since the launch last month, Gemma 4 had over 50 million downloads!

This post guides you through the key ideas from our conversation. Use it to quickly recap topics or dive deeper into specific segments with links and timestamps.

Gemma 4 - What is it?

Gemma 4 is the latest generation of open models from Google DeepMind, built on the same foundational research as Gemini 3. The family is designed to deliver exceptional "intelligence per parameter" across a range of deployment scenarios, from mobile phones to powerful workstations. The Gemma 4 model family now spans three distinct architectures:

Small Sizes (E2B & E4B): Optimized for ultra-mobile, edge, and browser deployment (such as Pixel or Chrome).
Dense (31B): A powerful 31-billion parameter model that provides server-grade performance for local execution on consumer GPUs.
Mixture-of-Experts (26B MoE): A highly efficient architecture designed for high-throughput tasks and advanced reasoning.

With the shift to an Apache 2 license, these models provide developers and startups with the flexibility to build, modify, and commercialize applications while maintaining full control over their infrastructure.

Omar Sanseviero on how Gemma 4 changes the landscape for agent developers

Timestamp: 1:40

Omar highlighted that Gemma 4 brings "very high intelligence per parameter," making it possible to run agentic workflows entirely offline. We saw examples of multiple Gemma instances running locally to generate SVGs (1:53) and an Android-based agent picking specific skills, like playing the piano, to complete tasks (2:45). As Omar noted, "This means that you can run very powerful things with very little hardware overhead...even in the phone that you have in your pocket."

The Factory Floor

Building a Local Food Tour Agent

Timestamp: 5:29

We showcased a food tour agent powered by Gemma 4 using the Agent Development Kit (ADK) and a Google Maps MCP server. We demonstrated how a local model can handle complex, multi-step reasoning tasks.

The agent identified the best ramen spots in Seattle under a $30 budget.
It verified that the locations were within walking distance of each other.
It processed search results to provide specific tips on what to order and what to avoid.

Autonomous Python Code Execution

Timestamp: 8:03

In this demo, we pushed Gemma 4's coding capabilities to the limit by asking it to express itself through animation. Using a sandbox execution environment, the model performed the following:

Wrote Python code using the Matplotlib library.
Attempted to build a physics engine to simulate a bouncing ball.
Self-corrected when the initial execution environment lacked certain CPU features, finding an alternative path to successfully generate the animation.
Demonstrated a deep understanding of real-world physics and gravity through code.

The Shift to Apache 2 Licensing

Timestamp: 4:05

A major theme of the conversation was the community-driven decision to move Gemma 4 to an Apache 2 license. This change provides developers and startups with maximum flexibility to build, modify, and commercialize applications. Omar emphasized that this was a direct response to developer feedback, aiming to unlock a new wave of innovation in the open models ecosystem.

Developer Q&A

Architectural Decisions and Mixture of Experts (MoE)

Timestamp: 17:23

Omar explained the technical shifts that make Gemma 4 so efficient. For the first time, the Gemma family includes a Mixture of Experts (MoE) architecture, which optimizes for extremely low latency in production. Additionally, the smaller E2B and E4B models utilize per-layer embeddings to remain "cheap" to run on GPUs. For vision tasks, the model now supports variable aspect ratios, allowing it to understand images of various sizes more accurately than previous fixed-resolution versions.

Comparing Gemma to Gemini

Timestamp: 19:51

When asked how Gemma stacks up against its larger sibling, Gemini, Omar clarified that they serve different purposes. While Gemini excels at massive-scale tasks and deep "world knowledge" due to its size, Gemma is the "best open model that can run on a single consumer GPU." It is specifically optimized for instruction following, coding, and agentic use cases where local deployment or fine-tuning is required.

Fine-Tuning for Specialized Industries

Timestamp: 21:10

The conversation touched on the importance of "Sovereign AI" and privacy. Because Gemma is an open model, developers in regulated industries, like healthcare or finance, can fine-tune the model on their private data and deploy it within their own air-gapped infrastructure. This gives developers full control over their data and the model's specialized expertise.

Conclusion

Gemma 4 marks a turning point for agentic development, proving that you don't always need a massive cloud cluster to build something smart. Whether it's running a physics simulation on a laptop or a travel guide on a phone, the barrier to entry for high-performance AI has never been lower. We are entering an era where the "conductor" of the AI orchestra can be any developer with a single GPU and a great idea.

Your turn to build

Now that you've seen what Gemma 4 can do, it's time to start building. Check out the resources in our show notes, the food tour agent, the coding agent, explore the ADK support, and try running Gemma 4 on your local machine or on Cloud Run. We can't wait to see what agents you create!

Watch more of The Agent Factory → Reinforcement learning & fine-tuning on TP...

Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech

Connect with us

Shir Meir Lador → LinkedIn, X
Vlad Kolesnikov → LinkedIn, X
Omar Sanseviero → LinkedIn, X

Google Cloud x NVIDIA Meet Up - 5/20/26 @ 5:30pm, Mountain View, CA

Jen Harvey — Wed, 13 May 2026 22:34:18 +0000

Join Google Cloud and NVIDIA in Mountain View, CA for an evening dedicated to the developers, builders and visionaries shaping the next era of technology.

Your chance to see the synergy of Google’s cloud ecosystem and NVIDIA’s cutting-edge hardware in action.

We’ll have food and drink with open demos centered on Gemma, Cloud Run with GPUs, ADK, Nemotron, and Physical AI.

What to Expect

Live Interactive Demos: Experience the power of the partnership firsthand. Get real-world applications and see how you can solve complex challenges with Google Cloud and NVIDIA products.
Networking: Connect with peers, GDEs, and leaders from both Google and NVIDIA to discuss implementation strategies, bottlenecks, and the future of the stack.

Whether you're coming to I/O or not, if you're actively building then this event is for you!

Timing

5:30 PM | Arrive, Check-in & Connect

6:00 PM | Opening Remarks - The vision of Google Cloud and NVIDIA; celebration of 1 year of the joint developer community.

6:15 PM | Demo Overview – Preview of demo showcase.

6:30-8:00 PM | Demo Showcase & Networking – Deep conversations over drinks, food and demos.

Hacking perfectly square AI videos with Veo 3.1 and NanoBanana 2

Paige Bailey — Tue, 12 May 2026 17:08:46 +0000

If you’ve been playing around with AI video generation lately, you already know the struggle: the tech is insanely cool, but sometimes getting it to output exactly the format you want feels like trying to center a <div> in 2014.

Recently, I needed to generate a perfectly looping, high-quality square (1:1) video with audio using Google's new video models. The problem? Native aspect ratio support can sometimes be finicky depending on the model tier, and cropping a generated 16:9 or 9:16 video often ruins the framing or hallucinates weird artifacts at the edges.

So, I had to let it cook. I came up with a slightly hacky but reliable workaround using NanoBanana 2, Veo 3.1 Lite, and our old reliable friend, FFmpeg.

Here is the ultimate pipeline to get flawless square AI videos:

TL;DR

Start with a square image concept.
Ask NanoBanana 2 to convert it to a 9:16 aspect ratio by literally just padding the top and bottom with black bars.
Feed that phone-format 9:16 image into Veo 3.1 Lite as your start and end frames to force a loop.
Run a quick Python script using ffmpeg to slice off the black bars.

Boom. Perfect square video. Perfect audio sync. And no weird edge hallucinations. Here’s how to automate this flow using Python. 🐍

Step 1: Generating the "phone format" 9:16 frames with NanoBanana 2

First, we need to generate our 9:16 image with the black bars baked in. Using the new Gemini API SDK, we can prompt NanoBanana 2 to do the heavy lifting for us.

from google import genai
from google.genai import types

# Initialize your client
client = genai.Client(api_key="YOUR_API_KEY")

def generate_padded_frame(prompt, output_filename):
    print("🎨 Generating padded 9:16 image with NanoBanana 2...")

    # We explicitly tell NanoBanana 2 to give us a 9:16 image 
    # where the subject is a square in the middle, padded by black bars.
    hacked_prompt = f"{prompt}. Keep the main subject perfectly square in the center, and pad the top and bottom with solid black bars to make the overall aspect ratio 9:16."

    result = client.models.generate_images(
        model='nanobanana-2', # Our trusty image model
        prompt=hacked_prompt,
        config=types.GenerateImagesConfig(
            number_of_images=1,
            aspect_ratio="9:16",
            output_mime_type="image/jpeg"
        )
    )

    # Save the output
    for generated_image in result.generated_images:
        image = generated_image.image
        image.save(output_filename)
        print(f"✅ Saved to {output_filename}")

# Generate our start/end frame
generate_padded_frame("A majestic pink flamingo standing in a serene pond", "flamingo_padded.jpg")

Step 2: Generating the video with Veo 3.1 Lite

Now that we have our 9:16 image with black bars (flamingo_padded.jpg), we pass it to Veo 3.1 Lite. By using the same image as the visual prompt, we ensure the video maintains those exact black bars throughout the generation process.

(Note: In the Veo web UI, you can set this as the start and end frame for a perfect loop. Here is the API equivalent for generating the video from your image).

import time

def generate_video(image_path, video_prompt, output_filename):
    print("🎬 Uploading frame and prompting Veo 3.1 Lite...")

    # Upload the padded image to the Gemini API
    initial_frame = client.files.upload(file=image_path)

    # Wait for the file to be processed
    while initial_frame.state.name == "PROCESSING":
        print(".", end="", flush=True)
        time.sleep(2)
        initial_frame = client.files.get(name=initial_frame.name)

    # Call Veo 3.1 Lite
    # We ask it to animate the subject but keep the black bars untouched
    response = client.models.generate_content(
        model='veo-3.1-lite',
        contents=[
            initial_frame, 
            f"{video_prompt}. The flamingo moves slightly, but the black bars at the top and bottom must remain exactly the same."
        ]
    )

    # Save the generated video bytes
    with open(output_filename, "wb") as f:
        f.write(response.text.encode('utf-8')) # Handling depends on raw bytes returned
    print(f"\n✅ Video generated and saved as {output_filename}")

generate_video("flamingo_padded.jpg", "Cinematic shot of a flamingo looking around", "raw_veo_output.mp4")

Step 3: The `ffmpeg` post-processing

Now we have a beautiful video of a flamingo, but it's a 9:16 file with annoying black bars at the top and bottom.

We could crop this frame-by-frame using Python libraries like MoviePy, but honestly? ffmpeg via the subprocess module is infinitely faster, uses way less memory, and most importantly: it perfectly preserves the audio stream without degrading it through re-encoding.

Since the video is 9:16, trimming it to iw:iw (input width : input width) creates a perfect 1:1 square. FFmpeg is smart enough to center the crop automatically, perfectly slicing off the top and bottom black bars.

import subprocess

def crop_to_square(input_video, output_video):
    print("✂️ Cropping out the black bars with FFmpeg...")

    command =[
        'ffmpeg',
        '-y',                 # Overwrite output if it exists
        '-i', input_video,    # Input file
        '-vf', 'crop=iw:iw',  # Video Filter: Crop to width x width (automatically centered!)
        '-c:a', 'copy',       # Copy the audio as-is (chef's kiss for performance)
        output_video
    ]

    try:
        subprocess.run(command, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
        print(f"🔥 Success! Perfectly square video saved to {output_video}")
    except subprocess.CalledProcessError as e:
        print(f"💀 FFmpeg failed: {e}")

# Run the final crop
crop_to_square("raw_veo_output.mp4", "final_square_flamingo.mp4")

// Detect dark theme var iframe = document.getElementById('tweet-2054247282841985125-771'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2054247282841985125&theme=dark" }

Why this workaround actually... works

Framing control: When you force the AI to outpaint black bars first, you control the framing of the main subject. You aren't relying on the video model to guess what to keep in the center.
Audio preservation: The '-c:a', 'copy' flag in FFmpeg ensures you don't lose any audio fidelity when manipulating the video file.
Zero hallucinations: Because the video model is explicitly told to keep the black bars, it doesn't waste compute trying to generate weird background details at the extreme top and bottom edges.

Sometimes the best engineering solutions are just stacking simple tools together in a trench coat. 🧥

Have you guys found any other weird/genius hacks for wrangling AI video generation APIs? Drop them in the comments, I’d love to test them out!

(P.S. Make sure you have ffmpeg already installed on your machine before running the Python script, or it will yell at you).

Deploying a Multi-Agent System with Terraform and Cloud Run

Shir Meir Lador — Thu, 07 May 2026 21:04:12 +0000

In support of our mission to accelerate the developer journey on Google Cloud, we built Dev Signal: a multi-agent system designed to transform raw community signals into reliable technical guidance by automating the path from discovery to expert creation.

In the first three parts of this series, we laid the essential groundwork by establishing its core capabilities and local verification process:

In part 1, we standardize the agent's capabilities through the Model Context Protocol (MCP), connecting it to Reddit for trend discovery and Google Cloud Docs for technical grounding. In part 2, we built a multi-agent architecture and integrated the Vertex AI memory bank to allow the system to learn and persist user preferences across different conversations. In part 3, we verified the full end-to-end lifecycle locally using a dedicated test runner to ensure that research, content creation, and cloud-based memory retrieval were perfectly synchronized.

If you'd like to dive straight into the code, you can clone the repository here.

Deployment to Cloud Run and the Path to Production

To help you transition from this local prototype to a production service, this final part focuses on building the production backbone of your agent using the foundational deployment patterns provided by the Agent Starter Pack. We will implement the essential structural components required for monitoring, data integrity, and long-term state management in the cloud. You will learn to implement the application server and helper utilities needed for a production-ready deployment before provisioning secure, reproducible infrastructure with Terraform.

While the Dockerfile packages your agent's code and its specialized dependencies, such as Node.js for the Reddit MCP tool, Terraform is used to build the platform it lives on. Terraform automates the creation of your Artifact Registry, least-privilege service accounts, and Secret Manager integrations to ensure your API keys remain protected.

By the end of this part, you will have a standardized application framework deployed on Google Cloud Run and a roadmap for graduating your prototype through continuous evaluation, CI/CD and advanced observability.

Production Utilities and Server: Building the System's Body

In this section, you implement the structural components required for monitoring and long-term state management in the cloud.

The Application Server: Initializing the FastAPI server and establishing a vital connection to the Vertex AI memory bank.
Implementing Telemetry: Enabling 'Agent Traces' for visibility into internal reasoning.

The Application Server

The fast_api_app.py file serves as the vital entry point for your agent, transforming the core logic into a production FastAPI server that acts as the "body" of your system. When deploying to Cloud Run, this server is essential because it provides the necessary web interface to listen for incoming HTTP requests and dispatch them to the agent for processing. Beyond basic serving, its most critical role is establishing a connection to the Vertex AI memory bank by defining a MEMORY_URI, which allows the ADK framework to persist and retrieve user preferences across different production sessions. Additionally, the application server initializes production-grade telemetry for real-time monitoring.

Go back to the dev_signal_agent folder.

cd ..

Paste the following code in dev_signal_agent/fast_api_app.py:

import os
from fastapi import FastAPI
from google.adk.cli.fast_api import get_fast_api_app
from google.cloud import logging as cloud_logging
from vertexai import agent_engines
from dev_signal_agent.app_utils.env import init_environment

# --- Initialization & Secure Secret Retrieval ---
# We now unpack the SECRETS dictionary returned by our updated env.py
PROJECT_ID, MODEL_LOC, SERVICE_LOC, SECRETS = init_environment()
logger = cloud_logging.Client().logger(__name__)

# Access sensitive credentials from the SECRETS dictionary
# These keys stay in memory and are NOT injected into os.environ
REDDIT_CLIENT_ID = SECRETS.get("REDDIT_CLIENT_ID")
REDDIT_CLIENT_SECRET = SECRETS.get("REDDIT_CLIENT_SECRET")
REDDIT_USER_AGENT = SECRETS.get("REDDIT_USER_AGENT")
DK_API_KEY = SECRETS.get("DK_API_KEY")

# --- Configuration & Sessions ---
AGENT_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
# Non-sensitive configuration uses environment variables
BUCKET = os.environ.get("AI_ASSETS_BUCKET")
USE_IN_MEMORY = os.environ.get("USE_IN_MEMORY_SESSION", "").lower() in ("true", "1")

# --- MEMORY BANK CONNECTION ---
def _get_memory_bank_uri():
    if USE_IN_MEMORY: return None, None
    # We use 'dev_signal_agent' as the display name for the Vertex AI memory bank
    name = os.environ.get("AGENT_ENGINE_MEMORY_BANK_NAME", "dev_signal_agent")
    existing = list(agent_engines.list(filter=f"display_name={name}"))
    ae = existing[0] if existing else agent_engines.create(display_name=name)
    uri = f"agentengine://{ae.resource_name}"
    print(f"DEBUG: Connecting to Memory Bank: {uri} (display_name={name})")
    return uri, uri

SESSION_URI, MEMORY_URI = _get_memory_bank_uri()

# --- Initialize FastAPI with ADK ---
app: FastAPI = get_fast_api_app(
    agents_dir=AGENT_DIR,
    web=True,
    artifact_service_uri=f"gs://{BUCKET}" if BUCKET else None,
    allow_origins=os.getenv("ALLOW_ORIGINS", "").split(",") if os.getenv("ALLOW_ORIGINS") else None,
    session_service_uri=SESSION_URI,
    memory_service_uri=MEMORY_URI, # <--- Connects the Memory Bank
    otel_to_cloud=True, # <--- Enables production telemetry
)

if __name__ == "__main__":
    import uvicorn
    # Standard Cloud Run port is 8080
    uvicorn.run(app, host="0.0.0.0", port=8080)

Implementing Telemetry

In a production environment, visibility into your agent's reasoning is critical. We leverage the built-in observability features of the Google ADK by setting the otel_to_cloud=True flag in our application server. This single parameter handles the majority of the instrumentation automatically, exporting "Agent Traces" directly to the Google Cloud Console. These traces provide a "visual waterfall" of the agent's operation, including individual agent thought processes, LLM invocations, and MCP tool calls.

Monitoring vs. Targeted Evaluation

It is essential to understand that production tracing is subject to sampling to balance performance and cost. Because Cloud Run captures only a subset of requests, not every individual user interaction will be visible.

System Traces (Monitoring): Used to analyze behavior "at large," such as identifying latency bottlenecks or system timeouts.
Reasoning Traces (Evaluation): High-quality evaluation mandates targeted trace capture. This means calling the agent specifically for a test case where you know you will evaluate that particular request in full detail.

Viewing the Trace

To see your traces, navigate to the Trace Explorer in the Google Cloud Console and filter for your service (e.g., dev-signal). Clicking a specific Trace ID opens a Gantt chart that allows you to distinguish between cognitive reasoning failures (wrong decisions) and physical system issues (timeouts).

For advanced configurations, refer to the following documentation:

Infrastructure as Code: Provisioning Secure Cloud Resources

We utilize the infrastructure-as-code patterns provided by the Agent Starter Pack's security-first design. The starter pack builds the professional platform required to automate the creation of least-privilege service accounts and robust secret management in seconds.

Using Terraform ensures that your entire Google Cloud environment - from IAM roles to Secret Manager versions - is defined in reproducible, secure code. We break our infrastructure into the following logical blocks:

Resources & Variables: Define the specific project, region, and sensitive API secrets used by the agent.
Core Infrastructure: Enable essential APIs and provision a private Artifact Registry to host your agent's container images.
Identity & Access Management (IAM): Configure specialized Service Accounts that strictly follow the Principle of Least Privilege to ensure your system remains secure.
Secret Management: Securely ingest API credentials into Google Secret Manager for protected runtime access.
Cloud Run Configuration: Define the container environment, resource limits, and automated secret injection for the final deployment.

To begin provisioning, return to the root folder of your project (dev-signal) and create the necessary deployment directories:

cd ..
mkdir deployment
cd deployment
mkdir terraform
cd terraform

Terraform Resources and Variables

The variables.tf file defines the configurable parameters for your deployment, allowing you to customize the infrastructure without altering the underlying logic. It includes variables for the project_id, the deployment region (defaulting to us-central1), and the service_name for your Cloud Run instance. Furthermore, it defines a secrets map used to securely ingest sensitive API credentials—such as Reddit and Developer Knowledge keys—into Google Secret Manager for runtime access. This modular approach ensures your production environment remains reproducible, secure, and adaptable across different projects.

Paste the following code into deployment/terraform/variables.tf:

variable "project_id" {
  description = "The Google Cloud Project ID"
  type        = string
}
variable "region" {
  description = "The Google Cloud region to deploy to"
  type        = string
  default     = "us-central1"
}
variable "service_name" {
  description = "The name of the Cloud Run service"
  type        = string
  default     = "dev-signal"
}
variable "secrets" {
  description = "A map of secret names and their values (e.g., REDDIT_CLIENT_ID, DK_API_KEY)"
  type        = map(string)
  default     = {}
}
variable "ai_assets_bucket" {
  description = "The GCS bucket for storing AI assets"
  type        = string
}

Core Infrastructure Logic

We define our infrastructure in logical blocks. Here is what each part does:

1. Enable APIs: Ensures the project has the necessary services active (Cloud Run, Vertex AI, etc.). We use disable_on_destroy = false to prevent accidental data loss if the Terraform is destroyed.

Paste the following code into deployment/terraform/main.tf:

resource "google_project_service" "services" {
  project = var.project_id
  for_each = toset([
    "run.googleapis.com",
    "artifactregistry.googleapis.com",
    "cloudbuild.googleapis.com",
    "aiplatform.googleapis.com",
    "secretmanager.googleapis.com",
    "logging.googleapis.com"
  ])
  service            = each.key
  disable_on_destroy = false
}

2. Artifact Registry: Creates a private Docker registry to store our agent's container images.

resource "google_artifact_registry_repository" "repo" {
  location      = var.region
  project       = var.project_id
  repository_id = "dev-signal-repo"
  description   = "Docker repository for Dev Signal Agent"
  format        = "DOCKER"
  depends_on    = [google_project_service.services]
}

3. Service Account & IAM: Adhering to the Principle of Least Privilege - This is a critical security step. In accordance with the Principle of Least Privilege, we avoid using the default compute service account and instead provision a dedicated user-managed service account (dev-signal-sa). By designating this as the Cloud Run service identity, we can grant it only the minimum necessary permissions—specifically roles/aiplatform.user, roles/logging.logWriter, and roles/storage.objectAdmin. This granular access control ensures that the agent has the exact permissions required to interact with Vertex AI and Cloud Storage without over-granting access to other sensitive cloud resources, significantly reducing the potential impact of a compromised account. Learn more best practices for using service accounts securely.

resource "google_service_account" "agent_sa" {
  project      = var.project_id
  account_id   = "${var.service_name}-sa"
  display_name = "Dev Signal Agent Service Account"
}

4. Secret Management: This handles your API keys securely. It creates secrets in Google Secret Manager and gives the agent's Service Account permission to access them at runtime.

resource "google_secret_manager_secret" "agent_secrets" {
  project  = var.project_id
  for_each = toset(keys(var.secrets))
  secret_id = each.key
  replication {
    auto {}
  }
  depends_on = [google_project_service.services]
}
resource "google_secret_manager_secret_version" "agent_secrets_version" {
  for_each    = toset(keys(var.secrets))
  secret      = google_secret_manager_secret.agent_secrets[each.key].id
  secret_data = var.secrets[each.key]
}
resource "google_secret_manager_secret_iam_member" "secret_accessor" {
  project  = var.project_id
  for_each = toset(keys(var.secrets))
  secret_id = google_secret_manager_secret.agent_secrets[each.key].id
  role      = "roles/secretmanager.secretAccessor"
  member    = "serviceAccount:${google_service_account.agent_sa.email}"
}

5. Cloud Run Configuration:

Security Best Practice: To satisfy production security standards, our main.tf grants the Service Account the secretmanager.secretAccessor role. Our Python application then uses the Secret Manager SDK to pull these credentials directly into local memory at runtime, ensuring they never touch the container's environment configuration

# 6. Cloud Run Service Deployment
resource "google_cloud_run_v2_service" "default" {
  project  = var.project_id
  name     = var.service_name
  location = var.region
  ingress  = "INGRESS_TRAFFIC_ALL"

  template {
    service_account = google_service_account.agent_sa.email

    containers {
      image = "us-docker.pkg.dev/cloudrun/container/hello" # Placeholder until first build

      env {
        name  = "GOOGLE_CLOUD_PROJECT"
        value = var.project_id
      }
      env {
        name  = "GOOGLE_CLOUD_LOCATION"
        value = "global"
      }
      env {
        name  = "GOOGLE_GENAI_USE_VERTEXAI"
        value = "True"
      }
      env {
        name  = "AI_ASSETS_BUCKET"
        value = var.ai_assets_bucket
      }

      resources {
        limits = {
          cpu    = "1"
          memory = "2Gi"
        }
      }
    }
  }

  traffic {
    type    = "TRAFFIC_TARGET_ALLOCATION_TYPE_LATEST"
    percent = 100
  }

Provision the Infrastructure

Before we can deploy our code, we need to provision the Google Cloud infrastructure we just defined.

Initialize Terraform: This downloads the necessary provider plugins. Run this in deployment/terraform folder:

terraform init

Create a Variables File:

Paste this code in deployment/terraform/terraform.tfvars and update it with your project details and secrets.

project_id       = "your-project-id"
region           = "us-central1"
service_name     = "dev-signal"
ai_assets_bucket = "your-bucket-name"
secrets = {
  REDDIT_CLIENT_ID     = "your_client_id"
  REDDIT_CLIENT_SECRET = "your_client_secret"
  REDDIT_USER_AGENT    = "your_user_agent"
  DK_API_KEY           = "your_dk_api_key"
}

Plan configuration: This allows you to review the changes before they are applied. Run this in the deployment/terraform folder:

terraform plan -out=plan.tfplan

Apply Configuration: Once you have reviewed the plan and confirmed it does what you want, run:

terraform apply plan.tfplan

Deployment: Containerization and the Cloud Build Pipeline

In this final stage of the build process, we package our agent's "body" and "brain" into a portable, production-ready container. This ensures that every component - from our Python logic to the Node.js environment required for the Reddit MCP tool - is bundled together with its exact dependencies.

We utilize a Dockerfile to define this environment and a Makefile to orchestrate the deployment pipeline. When you trigger the deployment, Google Cloud Build takes your local source code, builds the container image according to the Dockerfile, and stores it in the private Artifact Registry created earlier by Terraform. Finally, the pipeline automatically updates your Cloud Run service to serve traffic using this fresh image, completing the journey from local code to a live, secure cloud workload.

Paste this code in dev-signal/Dockerfile:

FROM python:3.12-slim

# Install Node.js and npm for MCP tools (like reddit-mcp)
RUN apt-get update && apt-get install -y \
    curl \
    && curl -fsSL https://deb.nodesource.com/setup_20.x | bash - \
    && apt-get install -y nodejs \
    && npm install -g reddit-mcp \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

RUN pip install --no-cache-dir uv==0.8.13

WORKDIR /code

COPY ./pyproject.toml ./README.md ./uv.lock* ./
COPY ./dev_signal_agent ./dev_signal_agent

RUN uv sync --frozen

EXPOSE 8080

CMD ["uv", "run", "uvicorn", "dev_signal_agent.fast_api_app:app", "--host", "0.0.0.0", "--port", "8080"]

The Makefile automates the build and deploys.

Paste this code in dev-signal/Makefile:

PROJECT_ID ?= $(shell gcloud config get-value project)
REGION     ?= us-central1
IMAGE_REPO ?= dev-signal-repo
IMAGE := $(REGION)-docker.pkg.dev/$(PROJECT_ID)/$(IMAGE_REPO)/agent:latest

# Deploy via Cloud Build & Container
docker-deploy:
    @echo "? Building and deploying to $(PROJECT_ID) via Cloud Build..."
    gcloud builds submit --tag $(IMAGE) --project $(PROJECT_ID) .
    gcloud run services update dev-signal \
        --image $(IMAGE) \
        --region $(REGION) \
        --project $(PROJECT_ID) \
        --labels dev-tutorial=dev-signal-agent

Deploy Application

Now that our infrastructure is ready, we can build and deploy the application code.

Run the following command from the root of your project:

make docker-deploy

What happens when you run this?

Build: Google Cloud Build takes your local code and the Dockerfile, builds a container image, and stores it in the Artifact Registry.
Deploy: It updates the Cloud Run service defined in Terraform to use this new image.

When the deployment completes, you should get a message like this:

Service [dev-signal] revision [dev-signal...] has been deployed and is serving 100 percent of traffic.

Service URL: https://dev-signal-...-.us-central1.run.app

Verification: Accessing and Testing Your Deployed Agent

Since production services are private by default, this section covers how to grant permissions and access the agent securely.

Managing IAM Permissions: Granting the necessary run.invoker role to authorized users.

Secure Access via Cloud Run Proxy: Using the gcloud proxy to interact with your live service.

Granting User Permissions

Before you can invoke the service, you must grant your Google account the roles/run.invoker role for this specific service. Run the following command:

gcloud run services add-iam-policy-binding dev-signal \
  --member="user:$(gcloud config get-value account)" \
  --role="roles/run.invoker" \
  --region=us-central1 \
  --project=$(gcloud config get-value project)

Launch the Proxy

Now, access your private service securely via the proxy:

gcloud run services proxy dev-signal \
  --region us-central1 \
  --project $(gcloud config get-value project)

Visit http://localhost:8080 to chat with your deployed agent! See a possible test scenario in part 3 of the series.

Summary

Congratulations! You have successfully built Dev Signal.

What we covered:

Tooling (MCP): You connected your agent to Reddit, Google Docs, and a Local Image Generator using the Model Context Protocol.
Architecture: You implemented a Root Orchestrator managing specialized agents (Scanner, Expert, Drafter).
Memory: You integrated Vertex AI memory bank to give your agent long-term persistence across sessions.
Production: You deployed the entire stack to Google Cloud Run using Terraform for secure, reproducible infrastructure.

You now have a solid foundation for building sophisticated, stateful AI applications on Google Cloud.

Local Testing of a Multi-Agent System with Memory

Shir Meir Lador — Thu, 07 May 2026 21:03:03 +0000

In part 1 and part 2 of this series, we established the essential groundwork by standardizing the core capabilities through the Model Context Protocol (MCP) and constructing a multi-agent architecture integrated with the Vertex AI memory bank to provide long-term intelligence and persistence. Now, we'll explore how to test your multi-agent system locally!

If you'd like to dive straight into the code and explore it at your own pace, you can clone the repository here.

Testing the Agent Locally

Before transitioning your agentic system to Google Cloud Run, it is essential to ensure that its specialized components work seamlessly together on your workstation. This testing phase allows you to validate trend discovery, technical grounding, and creative drafting within a local feedback loop, saving time and resources during the development process.

In this section, you will configure your local secrets, implement environment-aware utilities, and use a dedicated test runner to verify that Dev Signal can correctly retrieve user preferences from the Vertex AI memory bank on the cloud. This local verification ensures that your agent's "brain" and "hands" are properly synchronized before moving to deployment.

Environment Setup

Create a .env file in your project root. These variables are used for local development and will be replaced by Terraform/Secret Manager in production.

Paste this code in dev-signal/.env and update with your own details.

Note: GOOGLE_CLOUD_LOCATION is set as global because that is where gemini-3-flash-preview is supported. We will use GOOGLE_CLOUD_LOCATION for the model location.

# Google Cloud Configuration
GOOGLE_CLOUD_PROJECT=your-project-id
GOOGLE_CLOUD_LOCATION=global
GOOGLE_CLOUD_REGION=us-central1
GOOGLE_GENAI_USE_VERTEXAI=True
AI_ASSETS_BUCKET=your_bucket_name

# Reddit API Credentials
REDDIT_CLIENT_ID=your_client_id
REDDIT_CLIENT_SECRET=your_client_secret
REDDIT_USER_AGENT=my-agent/0.1

# Developer Knowledge API Key
DK_API_KEY=your_api_key

Helper Utilities

Create a new directory for your application utils:

cd dev_signal_agent
mkdir app_utils
cd app_utils

Environment Configuration

This module standardizes how the agent discovers the active Google Cloud Project and Region, ensuring a seamless transition between development environments. Using load_dotenv(), the script first checks for local configurations before falling back to google.auth.default() or environment variables to retrieve the Project ID. This automated approach ensures your agent is properly authenticated and grounded in the correct cloud context without requiring manual configuration changes.

Beyond basic project discovery, the script provides a robust Secret Management layer. It attempts to resolve sensitive credentials, such as Reddit API keys, first from the local environment (for rapid development) and then dynamically from the Google Cloud Secret Manager API for production security. By returning these as a dictionary rather than injecting them into environment variables, the module maintains a clean security posture.

The script further calibrates the environment by distinguishing between global and regional requirements for different AI services. It specifically assigns the "global" location for models to access cutting-edge preview features while designating a regional location, such as us-central1, for infrastructure like the Vertex AI Agent Engine.

Paste this code in dev_signal_agent/app_utils/env.py:

import os
import google.auth
import vertexai
from google.cloud import secretmanager
from dotenv import load_dotenv

def _fetch_secrets(project_id: str):
    """Fetch secrets from Secret Manager and return them as a dictionary."""
    secrets_to_fetch = ["REDDIT_CLIENT_ID", "REDDIT_CLIENT_SECRET", "REDDIT_USER_AGENT", "DK_API_KEY"]
    fetched_secrets = {}

    # First, check local environment (for local development via .env)
    for s in secrets_to_fetch:
        val = os.getenv(s)
        if val:
            fetched_secrets[s] = val

    # If keys are missing (common in production), fetch from Secret Manager API
    if len(fetched_secrets) < len(secrets_to_fetch):
        client = secretmanager.SecretManagerServiceClient()
        for secret_id in secrets_to_fetch:
            if secret_id not in fetched_secrets:
                name = f"projects/{project_id}/secrets/{secret_id}/versions/latest"
                try:
                    response = client.access_secret_version(request={"name": name})
                    fetched_secrets[secret_id] = response.payload.data.decode("UTF-8")
                except Exception as e:
                    print(f"Warning: Could not fetch {secret_id} from Secret Manager: {e}")
    return fetched_secrets

def init_environment():
    """Consolidated environment discovery."""
    load_dotenv()
    try:
        _, project_id = google.auth.default()
    except Exception:
        project_id = os.getenv("GOOGLE_CLOUD_PROJECT")

    model_location = os.getenv("GOOGLE_CLOUD_LOCATION", "global")
    service_location = os.getenv("GOOGLE_CLOUD_REGION", "us-central1")

    secrets = {}
    if project_id:
        vertexai.init(project=project_id, location=service_location)
        secrets = _fetch_secrets(project_id)

    return project_id, model_location, service_location, secrets

Local Testing Script

The Google ADK comes with a built-in Web UI that is excellent for visualizing agent logic and tool composition.

You can launch it by running in the project root:

uv run adk web

However, the default Web UI will not test the long-term memory integration described in this tutorial because it is not pre-connected to a Vertex AI memory session. By default, the generic UI often relies on in-memory services that do not persist data across sessions. Therefore, we use the dedicated test_local.py script to explicitly initialize the VertexAiMemoryBankService. This ensures that even in a local environment, your agent is communicating with the real cloud-based memory bank to validate preference persistence.

The test_local.py script:

Connects to the real Vertex AI Agent Engine in the cloud for memory storage.
Uses an in-memory session service for local chat history (so you can wipe it easily).
Runs a chat loop where you can talk to your agent.

Go back to the root folder dev-signal:

cd ../..

Paste this code in dev-signal/test_local.py:

import asyncio
import os
import google.auth
import vertexai
import uuid
from dotenv import load_dotenv
from google.adk.runners import Runner
from google.adk.memory.vertex_ai_memory_bank_service import VertexAiMemoryBankService
from google.adk.sessions import InMemorySessionService
from vertexai import agent_engines
from google.genai import types
from dev_signal_agent.agent import root_agent

# Load environment variables
load_dotenv()

async def main():
    # 1. Setup Configuration
    project_id = os.getenv("GOOGLE_CLOUD_PROJECT")
    # Agent Engine (Memory) MUST use a regional endpoint
    resource_location = "us-central1"
    agent_name = "dev-signal"

    print(f"--- Initializing Vertex AI in {resource_location} ---")
    vertexai.init(project=project_id, location=resource_location)

    # 2. Find the Agent Engine Resource for Memory
    existing_agents = list(agent_engines.list(filter=f"display_name={agent_name}"))
    if existing_agents:
        agent_engine = existing_agents[0]
        agent_engine_id = agent_engine.resource_name.split("/")[-1]
        print(f"✅ Using persistent Memory Bank from Agent: {agent_engine_id}")
    else:
        print(f"❌ Error: Agent Engine '{agent_name}' not found. Please deploy with Terraform first.")
        return

    # 3. Initialize Services
    session_service = InMemorySessionService()
    memory_service = VertexAiMemoryBankService(
        project=project_id,
        location=resource_location,
        agent_engine_id=agent_engine_id
    )

    # 4. Create a Runner
    runner = Runner(
        agent=root_agent,
        app_name="dev-signal",
        session_service=session_service,
        memory_service=memory_service
    )

    # 5. Run a Test Loop
    user_id = "local-tester"
    print("\n--- TEST SCENARIO ---")
    print("1. Start a session, tell the agent your preference (e.g., 'write in rhymes').")
    print("2. Type 'new' to start a FRESH session (local state wiped).")
    print("3. Ask for a blog post. The agent should retrieve your preference from the CLOUD memory.")

    current_session_id = f"session-{str(uuid.uuid4())[:8]}"
    await session_service.create_session(
        app_name="dev-signal",
        user_id=user_id,
        session_id=current_session_id
    )
    print(f"\n--- Chat Session (ID: {current_session_id}) ---")

    while True:
        user_input = input("\nYou: ")
        if user_input.lower() in ["exit", "quit"]:
            break

        if user_input.lower() == "new":
            current_session_id = f"session-{str(uuid.uuid4())[:8]}"
            await session_service.create_session(
                app_name="dev-signal",
                user_id=user_id,
                session_id=current_session_id
            )
            print(f"\n--- Fresh Session Started (ID: {current_session_id}) ---")
            print("(Local history is empty, retrieval must come from Memory Bank)")
            continue

        print("Agent is thinking...")
        async for event in runner.run_async(
            user_id=user_id,
            session_id=current_session_id,
            new_message=types.Content(parts=[types.Part(text=user_input)])
        ):
            if event.content and event.content.parts:
                for part in event.content.parts:
                    if part.text:
                        print(f"Agent: {part.text}")
            if event.get_function_calls():
                for fc in event.get_function_calls():
                    print(f"🛠️ Tool Call: {fc.name}")

if __name__ == "__main__":
    asyncio.run(main())

Running the Test

First, ensure you have your Application Default Credentials set up:

gcloud auth application-default login

Then run the script:

uv run test_local.py

Test Scenario

This scenario validates the full end-to-end lifecycle of the agent: from discovery and research to multimodal content creation and long-term memory retrieval.

Phase 1: Teaching & Multimodal Creation (Session 1)

Goal: Establish technical context and set a specific stylistic preference.

Discovery

Ask the agent to find trending Cloud Run topics.

Input: "Find high-engagement questions about AI agents on Cloud Run from the last 21 days."

Research

Instruct the agent to perform a deep dive on a specific result.

Input: "Use the GCP Expert to research topic #1."

Personalization

Request a blog post and explicitly set your style preference.

Input: "Draft a blog post based on this research. From now on, I want all my technical blogs written in the style of a 90s Rap Song."

Image Generation

Ask the agent to generate an image that demonstrates the main ideas in the blog using the Nano Banana Pro tool. The image will be saved to your bucket in Google Cloud and you should get the path to see it, which will look like: https://storage.mtls.cloud.google.com/...

Phase 2: Long-Term Memory Recall (Session 2)

Goal: Verify the agent recalls preferences across a completely fresh session.

Type new in the console to wipe local session history and start a fresh state.
Retrieval: Inquire about your stored preferences to test the Vertex AI memory bank.

Input: "What are my current topics of interest and what is my preferred blogging style?"

Verification: Confirm the agent successfully retrieves your "AI Agents on Cloud Run" interest and "Rap" style from the cloud.

Final Test: Ask for a new blog on a different topic (e.g., "GKE Autopilot") and ensure it is automatically written as a rap song without being prompted.

Summary

In this part of our series we focused on verifying the agent's functionality in a local environment before proceeding to cloud deployment. By configuring local secrets and utilizing environment-aware utilities, we used a dedicated test runner to confirm that the core reasoning and tool logic are properly integrated. We successfully validated the full lifecycle: from Reddit discovery to expert content creation, confirming that the agent correctly retrieves preferences from the cloud-based Vertex AI memory bank even in completely fresh sessions.

Ready to run the test scenario yourself? Clone the repository and try the test_local.py script to see 'Dev Signal' retrieve your preferences from the Vertex AI memory bank in real-time. For a deeper dive into the underlying mechanics of memory orchestration, check out this quickstart guide.

In the final part of this series, we will transition our prototype into a production service on Google Cloud Run using Terraform for secure infrastructure, and explore the roadmap to production excellence through continuous evaluation and security.

Special thanks to Remigiusz Samborski for the helpful review and feedback on this article.

For more content like this, follow me on LinkedIn and X.

Architect A Personalized Multi-Agent System with Long-Term Memory

Shir Meir Lador — Thu, 07 May 2026 21:01:06 +0000

In support of our mission to accelerate the developer journey on Google Cloud, we built Dev Signal — a multi-agent system designed to transform raw community signals into reliable technical guidance by automating the path from discovery to expert creation.

In the first part of this series for the Dev Signal, we laid the essential groundwork for this system by establishing a project environment and equipping core capabilities through the Model Context Protocol (MCP). We standardized our external integrations, connecting to Reddit for trend discovery, Google Cloud Docs for technical grounding, and building a custom Nano Banana Pro MCP server for multimodal image generation. If you missed Part 1 or want to explore the code directly, you can find the complete project implementation in our GitHub repository.

Now, in Part 2, we focus on building the multi-agent architecture and integrating the Vertex AI memory bank to personalize these capabilities. We will implement a Root Orchestrator that manages three specialist agents: the Reddit Scanner, GCP Expert, and Blog Drafter, to provide a seamless flow from trend discovery to expert content creation. We will also integrate a long-term memory layer that enables the agent to learn from your feedback and persist your stylistic preferences across different conversations. This ensures that Dev Signal doesn't just process data, but actually learns to match your professional voice over time.

Infrastructure and Model Setup

First, we initialize the environment and the shared Gemini model.

Paste this code in dev_signal_agent/agent.py:

from google.adk.agents import Agent
from google.adk.apps import App
from google.adk.models import Gemini
from google.adk.tools import google_search, AgentTool, load_memory_tool, preload_memory_tool
from google.adk.tools.tool_context import ToolContext
from google.genai import types
from dev_signal_agent.app_utils.env import init_environment
from dev_signal_agent.tools.mcp_config import (
    get_reddit_mcp_toolset,
    get_dk_mcp_toolset,
    get_nano_banana_mcp_toolset
)

PROJECT_ID, MODEL_LOC, SERVICE_LOC, SECRETS = init_environment()

shared_model = Gemini(
    model="gemini-3-flash-preview",
    vertexai=True,
    project=PROJECT_ID,
    location=MODEL_LOC,
    retry_options=types.HttpRetryOptions(attempts=3),
)

Memory Ingestion Logic

We want Dev Signal to do more than just follow instructions — we want it to learn from you. By capturing your preferences, such as specific technical interests on Reddit or a preferred blogging style, the agent can personalize its output for future use. To achieve this, we use the Vertex AI memory bank to persist session history across different conversations.

Long-term Memory

We automate this through the save_session_to_memory_callback function. This callback is configured to run automatically after every turn, ensuring that session details are captured and stored in the memory bank without manual intervention.

How Managed Memory Works:

Ingestion: The save_session_to_memory_callback sends the conversation data to Vertex AI.
Embedding: Vertex AI converts the text into numerical vectors (embeddings) that capture the semantic meaning of your preferences.
Storage: These vectors are stored in a managed index, enabling the agent to perform semantic searches and retrieve relevant history in future sessions.
Retrieval: The agent recalls this history using built-in ADK tools. The PreloadMemoryTool proactively brings in context at the start of an interaction, while the LoadMemoryTool allows the agent to fetch specific memories on an as-needed basis.

Paste this code in dev_signal_agent/agent.py:

async def save_session_to_memory_callback(*args, **kwargs) -> None:
    """
    Defensive callback to persist session history to the Vertex AI memory bank.
    """
    ctx = kwargs.get("callback_context") or (args[0] if args else None)
    # Check connection to Memory Service
    if ctx and hasattr(ctx, "_invocation_context") and ctx._invocation_context.memory_service:
        # Save the session!
        await ctx._invocation_context.memory_service.add_session_to_memory(
            ctx._invocation_context.session
        )

Short-term Memory

The add_info_to_state function serves as the agent's short-term working memory, allowing the gcp_expert to reliably hand off its detailed findings to the blog_drafter within the same session. This working memory and the conversation transcript are managed by the Vertex AI Session Service to ensure that active context survives server restarts or transient failures.

The boundary between session-based state and long-term persistence — It is important to note that while this service provides stability during an active interaction, this short-term memory does not persist between different sessions. Starting a fresh session ID effectively resets this working state, ensuring a clean slate for new tasks. Cross-session continuity, where the agent remembers your stylistic preferences or past feedback, is handled by the Vertex AI Memory Bank.

Paste this code in dev_signal_agent/agent.py

def add_info_to_state(tool_context: ToolContext, key: str, data: str) -> dict:
    tool_context.state[key] = data
    return {"status": "success", "message": f"Saved '{key}' to state."}

Specialist 1: Reddit Scanner (Discovery)

The Reddit Scanner is our "Trend Spotter," it identifies high-engagement questions from the last 21 days (3 weeks) to ensure that all research findings remain both timely and relevant.

Memory Usage: It leverages load_memory to retrieve your past areas of interest and preferred topics from the Vertex AI memory bank. If relevant history exists, the agent prioritizes those specific topics in its search to provide a personalized discovery experience.

Beyond simple retrieval, each sub-agent actively updates its memories by listening for new preferences and explicitly acknowledging them during the chat. This process captures relevant information in the session history, where an automated callback then persists it to the long-term Vertex AI memory bank for future use.

This memory management is supported by two distinct retrieval patterns within the Google Agent Development Kit (ADK). The first is the PreloadMemoryTool, which proactively brings in historical context at the beginning of every interaction to ensure the agent is fully briefed before addressing the current request. The second is the LoadMemoryTool, which the agent uses on an as-needed basis, calling upon it only when it decides that deeper past knowledge would be beneficial for the current step in the workflow.

Paste this code in dev_signal_agent/agent.py

# Singleton toolsets
reddit_mcp = get_reddit_mcp_toolset(
    client_id=SECRETS.get("REDDIT_CLIENT_ID", ""),
    client_secret=SECRETS.get("REDDIT_CLIENT_SECRET", ""),
    user_agent=SECRETS.get("REDDIT_USER_AGENT", "")
)

reddit_scanner = Agent(
    name="reddit_scanner",
    model=shared_model,
    instruction="""
You are a Reddit research specialist. Your goal is to identify high-engagement questions
from the last 3 weeks on specific topics of interest, such as AI/agents on Cloud Run.

Follow these steps:
1. **MEMORY CHECK**: Use `load_memory` to retrieve the user's **past areas of interest** and **preferred topics**. Calibrate your search to align with these interests.
2. Use the Reddit MCP tools to search for relevant subreddits and posts.
3. Filter results for posts created within the last 21 days (3 weeks).
4. Analyze "high-engagement" based on upvote counts and the number of comments.
5. Recommend the most important and relevant questions for a technical audience.
6. **CRITICAL**: For each recommended question, provide a direct link to the original thread and a concise summary of the discussion.
7. **CAPTURE PREFERENCES**: Actively listen for user preferences, interests, or project details. Explicitly acknowledge them to ensure they are captured in the session history for future personalization.
""",
    tools=[reddit_mcp, load_memory_tool.LoadMemoryTool()],
    after_agent_callback=save_session_to_memory_callback,
)

Specialist 2: GCP Expert (Grounding)

The GCP Expert is our "Technical Authority". It triangulates facts by synthesizing official documentation from the Google Cloud Developer Knowledge MCP Server, community sentiment from Reddit, and broader context from Google Search.

Paste this code in dev_signal_agent/agent.py

dk_mcp = get_dk_mcp_toolset(api_key=SECRETS.get("DK_API_KEY", ""))

search_agent = Agent(
    name="search_agent",
    model=shared_model,
    instruction="Execute Google Searches and return raw, structured results (Title, Link, Snippet).",
    tools=[google_search],
)

gcp_expert = Agent(
    name="gcp_expert",
    model=shared_model,
    instruction="""
You are a Google Cloud Platform (GCP) documentation expert.
Your goal is to provide accurate, detailed, and cited answers to technical questions by synthesizing official documentation with community insights.

For EVERY technical question, you MUST perform a comprehensive research sweep using ALL available tools:
1. **Official Docs (Grounding)**: Use DeveloperKnowledge MCP (`search_documents`) to find the definitive technical facts.
2. **Social Media Research (Reddit)**: Use the Reddit MCP to research the question on social media. This allows you to find real-world user discussions, common pain points, or alternative solutions that might not be in official documentation.
3. **Broader Context (Web/Social)**: Use the `search_agent` tool to find recent technical blogs, social media discussions, or tutorials.

Synthesize your answer:
- Start with the official answer based on GCP docs.
- Add "Social Media Insights" or "Common Issues" sections derived from Reddit and Web Search findings.
- **CRITICAL**: After providing your answer, you MUST use the `add_info_to_state` tool to save your full technical response under the key: `technical_research_findings`.
- Cite your sources specifically at the end of your response, providing **direct links** (URLs) to the official documentation, blog posts, and Reddit threads used.
- **CAPTURE PREFERENCES**: Actively listen for user preferences, interests, or project details. Explicitly acknowledge them to ensure they are captured in the session history for future personalization.
""",
    tools=[dk_mcp, AgentTool(search_agent), reddit_mcp, add_info_to_state],
    after_agent_callback=save_session_to_memory_callback,
)

Specialist 3: Blog Drafter (Creativity)

The Blog Drafter is our Content Creator. It drafts the blog based on the expert's findings and offers to generate visuals.

Memory Usage: It checks load_memory for the user's preferred writing style (e.g. "Witty", "Rap") stored in the Vertex AI memory bank.

Paste this code in dev_signal_agent/agent.py

nano_mcp = get_nano_banana_mcp_toolset()

blog_drafter = Agent(
    name="blog_drafter",
    model=shared_model,
    instruction="""
You are a professional technical blogger specializing in Google Cloud Platform.
Your goal is to draft high-quality blog posts based on technical research provided by the GDE expert and reliable documentation.

You have access to the research findings from the gcp_expert_agent here:
{{ technical_research_findings }}

Follow these steps:
1. **MEMORY CHECK**: Use `load_memory` to retrieve past blog posts, **areas of interest**, and user feedback on writing style. Adopt the user's preferred style and depth.
2. **REVIEW & GROUND**: Review the technical research findings provided above. **CRITICAL**: Use the `dk_mcp` (Developer Knowledge) tool to verify key facts, technical limitations, and API details. Ensure every claim in your blog is grounded in official documentation.
3. Draft a blog post that is engaging, accurate, and helpful for a technical audience.
4. Include code snippets or architectural diagrams if relevant.
5. Provide a "Resources" section with links to the official documentation used.
6. Ensure the tone is professional yet accessible, while adhering to any style preferences found in memory.
7. **VISUALS**: After presenting the drafted blog post, explicitly ask the user: "Would you like me to generate an infographic-style header image to illustrate these key points?" If they agree, use the `generate_image` tool (Nano Banana).
8. **CAPTURE PREFERENCES**: Actively listen for user preferences, interests, or project details. Explicitly acknowledge them to ensure they are captured in the session history for future personalization.
""",
    tools=[dk_mcp, load_memory_tool.LoadMemoryTool(), nano_mcp],
    after_agent_callback=save_session_to_memory_callback,
)

The Root Orchestrator

The root agent serves as the system's strategist, managing a team of specialist agents and orchestrating their actions based on the specific goals provided by the user. At the start of a conversation, the orchestrator retrieves memory to establish context by checking for the user's past areas of interest, preferred topics, or previous projects.

Paste this code in dev_signal_agent/agent.py

root_agent = Agent(
    name="root_orchestrator",
    model=shared_model,
    instruction="""
You are a technical content strategist. You manage three specialists:
1. reddit_scanner: Finds trending questions and high-engagement topics on Reddit.
2. gcp_expert: Provides technical answers based on official GCP documentation.
3. blog_drafter: Writes professional blog posts based on technical research.

Your responsibilities:
- **MEMORY CHECK**: At the start of a conversation, use `load_memory` to check if the user has specific **areas of interest**, preferred topics, or past projects. Tailor your suggestions accordingly.
- **CAPTURE PREFERENCES**: Actively listen for user preferences, interests, or project details. Explicitly acknowledge them to ensure they are captured in the session history for future personalization.
- If the user wants to find trending topics or questions from Reddit, delegate to reddit_scanner.
- If the user has a technical question or wants to research a specific theme, delegate to gcp_expert.
- **CRITICAL**: After the gcp_expert provides an answer, you MUST ask the user:
  "Would you like me to draft a technical blog post based on this answer?"
- If the user agrees or asks to write a blog, delegate to blog_drafter.
- Be proactive in helping the user navigate from discovery (Reddit) to research (Docs) to content creation (Blog).
""",
    tools=[load_memory_tool.LoadMemoryTool(), preload_memory_tool.PreloadMemoryTool()],
    after_agent_callback=save_session_to_memory_callback,
    sub_agents=[reddit_scanner, gcp_expert, blog_drafter]
)

app = App(root_agent=root_agent, name="dev_signal_agent")

Summary

In this part of our series, we built multi-agent architecture and implemented a robust, dual-layered memory system. We established a Root Orchestrator, managing three specialist agents: a Reddit Scanner for trend discovery, a GCP Expert for technical grounding, and a Blog Drafter for creative content creation.

By utilizing short-term state to pass information reliably between specialists and integrating the Vertex AI memory bank for long-term persistence, we've enabled the agent to learn from your feedback and remember specific writing styles across different conversations.

In Part 3, we will show you how to test the agent locally to verify these components on your workstation, before transitioning to a full production deployment on Google Cloud Run in Part 4. Can't wait for part 3? The full implementation is already available for you to explore on GitHub.

To learn more about the underlying technology, explore the Vertex AI Memory Bank overview or dive into the official ADK Documentation to see how to orchestrate complex multi-agent workflows.

Special thanks to Remigiusz Samborski for the helpful review and feedback on this article.

For more content like this, follow me on LinkedIn and X.

Building Capabilities for a Multi-Agent System with Google ADK, MCP, and Cloud Run

Shir Meir Lador — Thu, 07 May 2026 21:00:10 +0000

My team's mission is to accelerate the developer journey from writing code to running secure AI workloads on Google Cloud. To help developers succeed, we focus on identifying their most pressing questions and building demos that provide straightforward, easy-to-implement solutions.

Recently, I was struck with inspiration when the new Developer Knowledge MCP server was released. It led me to build Dev Signal—a multi-agent system designed with Google Agent Development Kit (ADK)—to identify technical questions from Reddit, research them using official documentation, and draft detailed technical blogs. Dev Signal also provides custom visuals using Nano Banana Pro. I even integrated a long-term memory layer so the agent remembers my specific preferences and blogging style.

By connecting my coding assistant, Gemini CLI, to the developer knowledge MCP server, I built and deployed this entire system to Google Cloud Run in just two days.

Whether you want to learn how to architect a complex multi-agent system with long term memory, leverage local and remote MCP servers for tool standardization, or write detailed Terraform scripts for secure Cloud Run deployment, I'll show you how!

If you'd rather dive straight into the code and explore it at your own pace, you can clone the repository here.

What you'll learn

In this four-part blog series, I'll walk you through the step-by-step process of how I brought this project to life. Each blog post captures the journey of building and deploying Dev Signal:

Part 1: Tools for building agent capabilities – You'll begin by setting up your project environment and equipping your agent with tools using the Model Context Protocol (MCP). You'll learn how to connect to Reddit for trend discovery, Google Cloud docs for technical grounding, and a custom Nano Banana Pro tool for image generation.
Part 2: The Multi-Agent Architecture with long term memory – You'll build the "brain" of the system by implementing a root orchestrator and a team of specialized agents. You'll also integrate the Vertex AI memory bank, enabling the agent to learn and persist your preferences across sessions.
Part 3: Testing the agent Locally – Before moving to the cloud, you'll synchronize the agent's components and verify its performance on your workstation. You'll use a dedicated test runner to simulate the full lifecycle of discovery, research, and multimodal creation, with a special focus on validating long-term memory persistence by connecting your local agent directly to the cloud-based Vertex AI memory bank.
Part 4: Deployment to Cloud Run and the Path to Production – Finally, you'll deploy your service on Google Cloud Run using Terraform for reproducible infrastructure. You'll also discuss the next steps required for a high quality secure production system.

Getting started with Dev Signal

Dev Signal is an intelligent monitoring agent designed to filter noise and create value. Dev Signal operates in the following ways:

Discovery: Scouts Reddit for high-engagement technical questions.
Grounding: Researches answers using official Google Cloud documentation to ensure accuracy.
Creation: Drafts professional technical blog posts based on its findings.
Multimodal Generation: Generates custom infographic headers for those posts.
Long-Term Memory: Uses Vertex AI memory bank to remember your feedback across different sessions.

Prerequisites

Before you begin, verify the following is installed:

Python 3.12+
uv (Python package manager): curl -LsSf https://astral.sh/uv/install.sh | sh
Google Cloud SDK (gcloud CLI) installed and authenticated.
Terraform (for infrastructure as code).
Node.js & npm (required for the Reddit MCP tool).

You will also need:

A Google Cloud Project with billing enabled.
APIs Enabled: Vertex AI, Cloud Run, Secret Manager, Artifact Registry.
Reddit API Credentials (Client ID, Secret) - You can get these from the Reddit Developer Portal.
Developer Knowledge API Key (for Google Cloud docs search) - Instructions on how to get it are here.

Project Setup

The Dev Signal system was built by first running the Agent Starter Pack, following the automated architect workflow described in the Agent Factory episode by Remigiusz Samborski and Vlad Kolesnikov. This foundation provided the project's modular directory structure, which is used to separate concerns between Agent Logic, Server Code, Utilities, and Tools.

The starter pack acts as a powerful starting point because it automates the creation of professional infrastructure, CI/CD pipelines, and observability tools in seconds. This allows you to focus entirely on the agent's unique intelligence while ensuring the underlying platform remains secure and scalable. By building on top of this generated boilerplate with AI assistance from Gemini CLI and Antigravity, the development process is highly accelerated.

The agent starter pack high level architecture:

1. Initialize the Project

Create a new directory for your project and initialize it. We'll use uv, which is an extremely fast Python package manager.

uv init dev-signal

2. Folder Structure

Our project will follow this structure. We will populate these files step-by-step.

dev-signal/
├── dev_signal_agent/
│   ├── __init__.py
│   ├── agent.py                # Agent logic & orchestration
│   ├── fast_api_app.py         # Application server & memory connection
│   ├── app_utils/              # Env Config
│   │   └── env.py
│   └── tools/                  # External capabilities
│       ├── __init__.py
│       ├── mcp_config.py       # Tool configuration (Reddit, Docs)
│       └── nano_banana_mcp/    # Custom local image generation tool
│           ├── __init__.py
│           ├── main.py
│           ├── nano_banana_pro.py
│           ├── media_models.py
│           ├── storage_utils.py
│           └── requirements.txt
├── deployment/
│   └── terraform/              # Infrastructure as Code
├── .env                        # Local secrets (API keys)
├── Makefile                    # Shortcuts for building/deploying
├── Dockerfile                  # Container definition
└── pyproject.toml              # Dependencies

3. Define Dependencies

Update your pyproject.toml with the necessary dependencies. We use google-adk for the agent framework and google-genai for the model interaction.

[project]
name = "dev-signal"
version = "0.1.0"
description = "A multi-agent system for monitoring and content creation."
readme = "README.md"
requires-python = ">=3.12, <3.14"
dependencies = [
    "google-adk>=0.1.0",
    "google-genai>=1.0.0",
    "mcp>=1.0.0",
    "python-dotenv>=1.0.0",
    "fastapi>=0.110.0",
    "uvicorn>=0.29.0",
    "google-cloud-logging>=3.0.0",
    "google-cloud-aiplatform>=1.38.0",
    "fastmcp>=2.13.0",
    "google-cloud-storage>=3.6.0",
    "google-auth>=2.0.0",
    "google-cloud-secret-manager>=2.26.0",
]

Run uv sync to install everything.

Create a new directory for the agent code.

mkdir dev_signal_agent
cd dev_signal_agent

Building the agent capabilities: MCP tools

Our agent needs to interact with the outside world. We use the Model Context Protocol (MCP) to standardize this. The Model Context Protocol (MCP) is a universal standard for connecting AI agents to external data and tools. Instead of writing custom API wrappers, we use standard MCP servers. This allows us to connect to APIs (Reddit), Knowledge Bases (Google Cloud Docs), and even local scripts (Image Generation using Nano Banana Pro) using a common interface. Create a new directory for the agent tools.

mkdir tools
cd tools

Tools Configuration

We'll define our toolsets in dev_signal_agent/tools/mcp_config.py.

This file defines the connection parameters for our three main tools.

Reddit: Connected via a local stdio subprocess.
Developer Knowledge: Connected via a remote HTTP endpoint.
Nano Banana: Connected via a local stdio subprocess (our custom Python script).

Reddit Search (Discovery Tool)

The Reddit MCP server acts as a bridge to the Reddit API, allowing your agent to discover trending posts and analyze engagement without you having to write complex API wrappers. To ensure portability, the code uses a "find or fetch" strategy: it first checks for a local installation and, if missing, automatically uses npx to download and run the server on demand.

Instead of a network connection, the agent launches the server as a local subprocess and communicates via standard input and output (stdio). Within the Google ADK, the McpToolset class acts as a universal wrapper that standardizes these connections, enabling your agent to interact with various tools, from community resources to custom scripts like the Nano Banana image generator, using a common interface. By securely passing API credentials through environment variables, the system ensures these "plug-and-play" modules function as a seamless bridge between the AI and external platforms.

Paste this code in dev_signal_agent/tools/mcp_config.py:

import os
import shutil
from mcp import StdioServerParameters
from google.adk.tools import McpToolset
from google.adk.tools.mcp_tool import StreamableHTTPConnectionParams, StdioConnectionParams

def get_reddit_mcp_toolset(client_id: str = "", client_secret: str = "", user_agent: str = ""):
    """
    Connects to the Reddit MCP server.
    This server runs as a local subprocess (stdio) and proxies requests to the Reddit API.
    """
    # Check if 'reddit-mcp' is installed globally, otherwise use npx to run it
    cmd = "reddit-mcp" if shutil.which("reddit-mcp") else "npx"
    args = [] if shutil.which("reddit-mcp") else ["-y", "--quiet", "reddit-mcp"]

    # Inject secrets into the environment of the subprocess only
    env = {
        **os.environ,
        "DOTENV_CONFIG_SILENT": "true",
        "LANG": "en_US.UTF-8"
    }
    if client_id: env["REDDIT_CLIENT_ID"] = client_id
    if client_secret: env["REDDIT_CLIENT_SECRET"] = client_secret
    if user_agent: env["REDDIT_USER_AGENT"] = user_agent

    return McpToolset(
        connection_params=StdioConnectionParams(
            server_params=StdioServerParameters(
                command=cmd,
                args=args,
                env=env  # Pass injected secrets directly to the subprocess
            ),
            timeout=120.0
        )
    )

Google Cloud Docs (Knowledge Tool)

The Developer Knowledge MCP server provides grounding for your agent by allowing it to search the entire corpus of official Google Cloud documentation. Unlike the local Reddit server, this is a managed service hosted by Google and accessed as a remote endpoint over the internet. It exposes specialized tools like google_developer_documentation_search for semantic queries and google_developer_documentation_fetch to retrieve full markdown content, ensuring that every technical claim the agent makes is supported by definitive, up-to-date facts.

Note: You can also connect your coding assistant tools such as Gemini CLI or Antigravity to the developer knowledge MCP server to empower them with handy up to date Google Cloud documentation. I used it when writing this blog!

To connect, the agent uses the McpToolset class with StreamableHTTPConnectionParams, pointing to a web URL instead of launching a local process. It securely authenticates using a DK_API_KEY (create your api key) passed in the request headers, allowing the agent to perform a "comprehensive research sweep" across official docs, community sentiment, and broader web context through a single standardized interface.

Paste this code in dev_signal_agent/tools/mcp_config.py:

def get_dk_mcp_toolset(api_key: str = ""):
    """
    Connects to Developer Knowledge (Google Cloud Docs).
    This is a remote MCP server accessed via HTTP.
    """
    headers = {}
    if api_key:
        headers["X-Goog-Api-Key"] = api_key
    else:
        # Fallback to os.environ for local testing if not passed via API
        headers["X-Goog-Api-Key"] = os.getenv("DK_API_KEY", "")

    return McpToolset(
        connection_params=StreamableHTTPConnectionParams(
            url="https://developerknowledge.googleapis.com/mcp",
            headers=headers
        )
    )

The Image Generator (Nano Banana MCP)

While we've used external MCP servers for Reddit and documentation, we can also build our own custom MCP server to wrap specific Python logic. In this case, we are creating an image generation tool powered by Gemini 3 Pro Image (also known as Nano Banana Pro). This demonstrates that any Python function can be standardized into a tool that any agent can understand.

How the image generation works:

FastMCP: We use the fastmcp library to drastically simplify server creation, allowing us to register Python functions as tools with just a few lines of code.
Gemini Integration: The server uses the Google GenAI SDK to call the gemini-3-pro-image-preview model, which converts the agent's descriptive prompts into raw image bytes.
GCS Upload & Hosting: Because agent interfaces typically require a URL to display images, the server automatically uploads the generated bytes to Google Cloud Storage (GCS) and returns a public link.

To connect this local tool, we use StdioConnectionParams because the server runs as a local subprocess communicating via standard input and output. This transport method directly matches the transport="stdio" configuration we will define in our server entrypoint, ensuring a seamless connection for your custom local scripts.

The following code defines the MCP connection in dev_signal_agent/tools/mcp_config.py. We use uv run to ensure the server starts in an isolated environment with all its dependencies correctly installed.

Paste this code in dev_signal_agent/tools/mcp_config.py:

def get_nano_banana_mcp_toolset():
    """
    Connects to our local 'Nano Banana' image generator.
    This demonstrates how to wrap a local Python script as an MCP tool.
    """
    path = os.path.join("dev_signal_agent", "tools", "nano_banana_mcp", "main.py")
    bucket = os.getenv("AI_ASSETS_BUCKET")

    return McpToolset(
        connection_params=StdioConnectionParams(
            server_params=StdioServerParameters(
                command="uv",
                args=["run", path],
                env={**os.environ, "AI_ASSETS_BUCKET": bucket}
            ),
            timeout=600.0  # Image generation can take time
        )
    )

Implementing the Nano Banana Pro Server Logic

Now, we will implement the actual logic for this server. This implementation is based on the Agent Factory demo code by Remigiusz Samborski. While Remi's original code provides instructions for deploying the MCP server to Cloud Run, we will run it here as a local subprocess for faster development and testing.

To get started, create the directory for our new server:

mkdir -p dev_signal_agent/tools/nano_banana_mcp
cd dev_signal_agent/tools/nano_banana_mcp

The Server Entrypoint (`main.py`)

This file acts as the "brain" that initializes and starts the MCP server.

FastMCP Initialization: We use the FastMCP library to create a server named "MediaGenerators" and register our generate_image function as a tool.
Safe Logging: The _initialize_console_logging function is critical. It forces all logs to sys.stderr. This is because the MCP "stdio" transport uses sys.stdout for communication between the agent and the tool; standard logs sent to stdout would corrupt that protocol.
Execution: The mcp.run(transport="stdio") line starts the server as a local subprocess, allowing it to listen for requests from your agent via standard input.

Paste this code in dev_signal_agent/tools/nano_banana_mcp/main.py:

import logging
import os
import sys
from fastmcp import FastMCP
from dotenv import load_dotenv
from nano_banana_pro import generate_image

def _initialize_console_logging(min_level: int = logging.INFO):
    # Ensure logs go to STDERR so they don't break the MCP stdio protocol
    handler = logging.StreamHandler(sys.stderr)
    logging.basicConfig(level=min_level, handlers=[handler], force=True)

tools = [generate_image]
mcp = FastMCP(name="MediaGenerators", tools=tools)

if __name__ == "__main__":
    load_dotenv()
    _initialize_console_logging()
    mcp.run(transport="stdio")

The Generation Logic (`nano_banana_pro.py`)

This is where the actual image generation happens using Gemini.

GenAI Client: We initialize the genai.Client() to interact with Google's generative models.
Model Selection: It specifically targets the gemini-3-pro-image-preview model. We set the response_modalities to "IMAGE" to tell the model we want pixels, not just text.
Robustness: The code includes a MAX_RETRIES loop (set to 5) to handle any transient generation errors, ensuring the agent has multiple attempts to get a valid image.
Byte Processing: Once the model generates the image, it arrives as raw inline data. We extract these bytes and call our helper to move them to the cloud.
URI Conversion: Finally, it replaces the internal gs:// path with a browser-accessible https:// URL so the user can actually see the image.

Paste this code in dev_signal_agent/tools/nano_banana_mcp/nano_banana_pro.py:

import logging
from typing import Literal, Optional
from google import genai
from google.genai import types
from media_models import MediaAsset
from storage_utils import upload_data_to_gcs

AUTHORIZED_URI = "https://storage.mtls.cloud.google.com/"
MAX_RETRIES = 5

async def generate_image(
    prompt: str,
    aspect_ratio: Literal["16:9", "9:16"] = "16:9",
) -> MediaAsset:
    """Generates an image using Gemini 3 Image model."""
    genai_client = genai.Client()
    content = types.Content(parts=[types.Part.from_text(text=prompt)], role="user")
    logging.info(f"Starting image generation for prompt: {prompt[:50]}...")

    asset = MediaAsset(uri="")
    for _ in range(MAX_RETRIES):
        response = genai_client.models.generate_content(
            model="gemini-3-pro-image-preview",
            contents=[content],
            config=types.GenerateContentConfig(
                response_modalities=["IMAGE"],
                image_config=types.ImageConfig(aspect_ratio=aspect_ratio)
            )
        )
        if response and response.parts:
            for part in response.parts:
                if part.inline_data and part.inline_data.data:
                    # Upload the raw bytes to GCS
                    gcs_uri = await upload_data_to_gcs(
                        "mcp-tools",
                        part.inline_data.data,
                        part.inline_data.mime_type
                    )
                    asset = MediaAsset(uri=gcs_uri)
                    break
        if asset.uri: break

    if not asset.uri:
        asset.error = "No image was generated."
    else:
        # Convert gs:// URI to an HTTP accessible URL if needed
        asset.uri = asset.uri.replace('gs://', AUTHORIZED_URI)
    logging.info(f"Image URL: {asset.uri}")
    return asset

GCS Upload Helper (`storage_utils.py`)

Since agents need a web link to display images, this utility handles the hosting on Google Cloud Storage (GCS).

Dynamic Bucket Selection: It looks for a bucket name in your environment variables, falling back from AI_ASSETS_BUCKET to LOGS_BUCKET_NAME to ensure it always has a place to save data.
Unique Filenames: We use an MD5 hash of the raw image data to create a unique filename. This prevents filename collisions and acts as a simple way to avoid duplicate uploads of the same image.
Cloud Upload: The blob.upload_from_string method pushes the raw image bytes directly to your GCS bucket.

Paste this code in dev_signal_agent/tools/nano_banana_mcp/storage_utils.py:

import hashlib
import mimetypes
import os
from google.cloud.storage import Client, Blob
from dotenv import load_dotenv

load_dotenv()

storage_client = Client()
ai_bucket_name = os.environ.get("AI_ASSETS_BUCKET") or os.environ.get("LOGS_BUCKET_NAME")
ai_bucket = storage_client.bucket(ai_bucket_name)

async def upload_data_to_gcs(agent_id: str, data: bytes, mime_type: str) -> str:
    file_name = hashlib.md5(data).hexdigest()
    ext = mimetypes.guess_extension(mime_type) or ""
    blob_name = f"assets/{agent_id}/{file_name}{ext}"
    blob = Blob(bucket=ai_bucket, name=blob_name)
    blob.upload_from_string(data, content_type=mime_type, client=storage_client)
    return f"gs://{ai_bucket_name}/{blob_name}"

Data Model (`media_models.py`)

This file ensures that our data follows a strict structure (Schema).

Structured Output: By using a Pydantic BaseModel, we guarantee that the tool always returns a consistent JSON object containing a uri (the link) and an optional error message. This makes it much easier for the AI agent to understand and process the tool's result.

Paste this code in dev_signal_agent/tools/nano_banana_mcp/media_models.py:

from typing import Optional
from pydantic import BaseModel

class MediaAsset(BaseModel):
    uri: str
    error: Optional[str] = None

Tool Dependencies (`requirements.txt`)

While we use uv to run our code, a requirements.txt file remains essential because it defines the specific dependencies uv needs to install for the Nano Banana server to function. This provides the necessary "ingredients" to set up the isolated environment before the server starts.

This file lists the three core libraries required for this tool:

google-cloud-storage: Used for hosting the generated images on the cloud.
google-genai: Provides the logic for the Gemini 3 Pro image generation.
fastmcp: The framework that turns our Python script into a standardized MCP tool.

Paste this code in dev_signal_agent/tools/nano_banana_mcp/requirements.txt:

google-cloud-storage==3.6.*
google-genai==1.52.*
fastmcp==2.13.*

Summary

In this first part of our series, we focused on establishing the agent's core capabilities by standardizing its external integrations through the Model Context Protocol (MCP). We initialized the project using uv for high-speed dependency management and successfully configured three critical toolsets: Reddit for trend discovery, Google Cloud Docs for technical grounding, and a custom "Nano Banana" MCP server for multimodal image generation. By utilizing the Google ADK's McpToolset, we've abstracted away complex API logic into simple, plug-and-play modules, ensuring that our tools share a common interface that decouples integration from intelligence.

For a deeper look into our technical foundation, you can explore the Developer Knowledge MCP server to learn more about knowledge grounding or visit the Google ADK GitHub repository to explore the framework's core capabilities.

With our toolset fully configured and ready for action, we can now move to Part 2, where we will build the multi-agent architecture and integrate the Vertex AI memory bank to orchestrate these capabilities. You can also jump ahead to Part 3, where we will show you how to test the agent locally to verify these components on your workstation. If you’d like to dive ahead, you can explore the complete code for the entire series in our GitHub repository.

Special thanks to Remigiusz Samborski for the helpful review and feedback on this article.

For more content like this, follow Shir on LinkedIn and X.

Multimodal RAG with the Gemini API File Search Tool: A Developer Guide

Patrick Loeber — Tue, 05 May 2026 17:43:58 +0000

The File Search tool in the Gemini API now supports multimodal retrieval by adding support for Gemini Embedding 2. This update allows images, such as charts, product photos, and diagrams, to be natively indexed and searched in the same store as your text-based documents.

This post covers how to use the File Search tool end-to-end: creating a store, uploading documents and images, querying with grounded generation, and retrieving image citations.

What is File Search?

Here's an example app you can try in AI Studio that lets you chat with your documents and image library

File Search is the Gemini API's built-in RAG tool. When you upload your documents, the API takes care of the heavy lifting: chunking, embedding, indexing, and retrieval. At query time, pass a file_search tool alongside your prompt, and the model automatically retrieves relevant chunks from your data to generate a grounded response.

Compared to rolling your own RAG pipeline, File Search offers:

Fully managed: No vector databases to provision or embedding pipeline to maintain.
Cost-effective: Storage and query-time embeddings are free. You only pay for the initial indexing embeddings and the standard Gemini input/output tokens.
Built-in citations: Every response includes grounding metadata that links the answer to specific documents and pages. For multimodal stores, citations also include downloadable image references.
Native image search: With the gemini-embedding-2 model, images are embedded directly rather than relying on OCR, enabling true visual retrieval.

Try It in AI Studio

Want to see multimodal File Search in action before writing any code? We built an example app in AI Studio that lets you chat with your documents and image library. Upload PDFs and images, then ask questions. The app retrieves relevant text and visuals in real time, complete with citations and page numbers so you can trace every answer back to its source.

Getting Started

Step 1: Create a File Search Store

A File Search Store is a persistent container for your document embeddings. Think of it as a managed vector database scoped to a project.

To enable multimodal search over images, specify gemini-embedding-2 as the embedding model. This parameter is optional; if omitted, the store defaults to gemini-embedding-001, which is cost-optimized for text-only workloads, and cannot be changed later.

To use the new features, make sure to install the latest Python SDK: pip install -U google-genai.

from google import genai
from google.genai import types

client = genai.Client()

# Create a multimodal store with gemini-embedding-2
# Omit embedding_model to use the default text-only model (gemini-embedding-001)
file_search_store = client.file_search_stores.create(
    config={
        "display_name": "product-catalog",
        "embedding_model": "models/gemini-embedding-2"
    }
)
print(f"Created store: {file_search_store.name}")

Embedding Model	Best For
`gemini-embedding-001` (default)	Text-heavy workloads, cost-optimized
`gemini-embedding-2`	Multimodal retrieval (documents and images)

Step 2: Upload Documents and Images

The simplest path is the upload_to_file_search_store method, which uploads and indexes a file in one step. With gemini-embedding-2, this works for both documents and images:

Note: Audio and video formats are currently not supported.

import time

# Upload a PDF document
operation = client.file_search_stores.upload_to_file_search_store(
    file_search_store_name=file_search_store.name,
    file="product_catalog.pdf",
    config={"display_name": "Product Catalog"}
)

# Wait for ingestion to complete
while not operation.done:
    time.sleep(5)
    operation = client.operations.get(operation)

# Upload product images directly
for image_file in ["sneaker_red.png", "sneaker_blue.jpeg", "sneaker_white.png"]:
    op = client.file_search_stores.upload_to_file_search_store(
        file_search_store_name=file_search_store.name,
        file=image_file,
        config={"display_name": image_file}
    )
    while not op.done:
        time.sleep(5)
        op = client.operations.get(op)

print("All files indexed!")

Behind the scenes, the API chunks documents, generates embeddings, and indexes the content. When using gemini-embedding-2, images within PDFs are also natively embedded alongside the text.

You can also import existing files from the Files API into a store.

Step 3: Query with File Search

Query your data by passing the file_search tool to generate_content:

response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents="Which sneakers come in red?",
    config={
        "tools": [{
            "file_search": {
                "file_search_store_names": [file_search_store.name]
            }
        }]
    }
)

print(response.text)

The system performs a file search to find the most similar and relevant chunks from the File Search store , and uses them to generate a grounded response.

Step 4: Inspect Citations and Retrieve Images

Every File Search response includes grounding metadata — essentially, a bibliography for the model's answer. It captures page numbers for the indexed information, allowing applications to point users directly to the right spot in a document. This is especially useful for rigorous fact-checking over large PDFs.

With multimodal stores, citations can include a media_id for referenced images, which can be downloaded directly:

grounding = response.candidates[0].grounding_metadata

for chunk in grounding.grounding_chunks:
    ctx = chunk.retrieved_context
    if ctx.media_id:
        # This is an image citation — download it
        print(f"Cited image: {ctx.title}")
        print(f"   Media ID: {ctx.media_id}")

        blob = client.file_search_stores.download_media(
            media_id=ctx.media_id
        )
        with open(f"cited_{ctx.title}.png", "wb") as f:
            f.write(blob)
    else:
        # Text citation with exact page number
        print(f"Cited text: {ctx.title}")
        if ctx.page_number:
            print(f"   Page: {ctx.page_number}")
        print(f"   {ctx.text[:200]}...")

# See which parts of the response are grounded in which sources
for support in grounding.grounding_supports:
    print(f"Claim: '{support.segment.text}'")
    print(f"  Grounded in chunks: {support.grounding_chunk_indices}")

This is powerful for building user-facing applications. It's now possible to show users the actual images the model used in its reasoning, not just a text description.

Managing Stores

Here's a quick reference for managing stores and documents:

# List all stores
for store in client.file_search_stores.list():
    print(f"{store.name} — {store.display_name}")

# List documents in a store
for doc in client.file_search_stores.documents.list(parent=file_search_store.name):
    print(f"  {doc.name}")

# Delete a specific document
client.file_search_stores.documents.delete(
    name="fileSearchStores/my-store/documents/old_doc"
)

# Delete an entire store (force=True also deletes all contained documents)
client.file_search_stores.delete(
    name=file_search_store.name,
    config={"force": True}
)

Power Features

Custom Metadata and Filtering

You can attach metadata to documents at upload time and use it to filter at query time. This is essential when a store contains diverse documents and searches need to be scoped:

# Upload with metadata
op = client.file_search_stores.upload_to_file_search_store(
    file_search_store_name=file_search_store.name,
    file="shoes_collection.pdf",
    config={
        "display_name": "Spring 2026 Shoes",
        "custom_metadata": [
            {"key": "category", "string_value": "footwear"},
            {"key": "season", "string_value": "spring-2026"},
            {"key": "price_tier", "numeric_value": 2}
        ]
    }
)

# Query with a metadata filter
response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents="Do you have blue spring shoes?",
    config={
        "tools": [{
            "file_search": {
                "file_search_store_names": [file_search_store.name],
                "metadata_filter": 'category="footwear" AND season="spring-2026"',
            }
        }]
    }
)

Structured Output

Starting with Gemini 3 models, File Search can be combined with structured output. This is perfect for extracting structured data from grounded responses:

from pydantic import BaseModel, Field

class ProductMatch(BaseModel):
    name: str = Field(description="Product name")
    description: str = Field(description="Brief product description")
    confidence: str = Field(description="How confident the match is")

response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents="Find products similar to a red running shoe",
    config={
        "tools": [{
            "file_search": {
                "file_search_store_names": [file_search_store.name]
            }
        }],
        "response_mime_type": "application/json",
        "response_schema": ProductMatch.model_json_schema()
    }
)

Chunking Configuration

For more control over how documents are split, the chunking strategy can be configured:

operation = client.file_search_stores.upload_to_file_search_store(
    file_search_store_name=file_search_store.name,
    file="long_document.pdf",
    config={
        "display_name": "Technical Manual",
        "chunking_config": {
            "white_space_config": {
                "max_tokens_per_chunk": 200,
                "max_overlap_tokens": 20
            }
        }
    }
)

Use Cases

With multimodal retrieval, File Search opens up scenarios that text-only RAG can't handle:

Visual product search: Index catalogs with images and spec sheets, then search by visual similarity or natural language descriptions.
Research and technical documentation: Retrieve specific charts, architecture diagrams, or data visualizations from papers and reports.
Insurance and claims processing: Combine structured forms with damage photos for unified document and visual assessment.
Design systems: Make component libraries searchable by visual appearance, not just naming conventions.
Real estate and property listings: Match properties based on floor plans, interior photos, and visual preferences.

Pricing

File Search is designed to be cost-effective:

Indexing: You pay for embeddings at indexing time (embeddings pricing).
Storage: Free.
Query-time embeddings: Free.
Retrieved tokens: Charged as regular context tokens.

Get Started

Here's everything needed to get started:

File Search documentation
File Search quickstart notebook
The latest Python SDK: Install it with pip install -U google-genai
Get an API key

Create your store with gemini-embedding-2, upload some images, and start building multimodal RAG applications.

Fine-Tuning Gemma 4 with Cloud Run Jobs: Serverless GPUs (NVIDIA RTX 6000 Pro) for pet breed classification 🐈🐕

Shir Meir Lador — Tue, 28 Apr 2026 19:54:21 +0000

Google has just announced the release of Gemma 4! This new generation of open models brings significant advancements, particularly in reasoning capabilities and architectural efficiency.

Bridging Reasoning and Precision with Gemma 4

In my previous blog, I demonstrated how to fine-tune Gemma 3 27B on Cloud Run Jobs using NVIDIA RTX PRO 6000 Blackwell Edition GPUs for pet breed classification. With the release of Gemma 4, I couldn't wait to update my pipeline and see how the new model performs.

In this follow-up post, I'll explain what makes Gemma 4 different, the benefits it brings, and exactly what file modifications and workarounds are needed to successfully fine-tune it using PEFT (LoRA) on Cloud Run. We'll cover everything from memory requirements and dynamic label masking to prompt structures for reasoning models. Whether you read the previous post or are new to this pipeline, this guide will provide a complete, working solution for Gemma 4.

If you'd rather dive straight into the code and explore it at your own pace, you can clone the repository here.

What's New in Gemma 4?

Gemma 4 introduces groundbreaking improvements over Gemma 3, making it Google's most intelligent open model family to date:

Apache 2.0 License: Gemma 4 is released under a commercially permissive Apache 2.0 license, providing full developer flexibility.
Highly Competitive Benchmarks: The 31B model ranks as the #3 open model on the Arena AI text leaderboard, while the 26B MoE model ranks #6, outcompeting models 20x their size!
Advanced Reasoning & Agents: Purpose-built for multi-step planning and deep logic. It features native support for function-calling, structured JSON output, and native system instructions.
Multimodal & Long Context: Natively processes images, video, and even audio (in edge models). It supports up to a 256K context window for larger models.
Versatile Architectures: Includes a 26B Mixture of Experts (MoE) model that only activates 3.8B parameters during inference for fast response times.

Because of these changes, simply dropping Gemma 4 into a Gemma 3 fine-tuning script won't work out of the box. Here is a breakdown of what needed to change in the codebase to make it work.

GPU Memory and Parameter Capacity

With the availability of NVIDIA RTX PRO 6000 GPUs on Cloud Run, we now have access to 96GB of VRAM. This is a game-changer for hosting and fine-tuning large models.

According to the formula discussed in my blog post on Decoding high-bandwidth memory: Total HBM ≈ (Model Size) + (Optimizer States) + (Gradients) + (Activations)

When using LoRA (Low-Rank Adaptation), we freeze the base model weights and only train a small subset of parameters. This means the memory-hungry gradients and optimizer states are negligible for the base model. For Gemma 4 31B loaded in 16-bit precision (bfloat16), the base model size is roughly 31 billion parameters × 2 bytes/parameter ≈ 62 GB. While this 62GB model fits comfortably within the 96GB of VRAM available on the RTX 6000 Pro, we can do even better!

By applying 4-bit quantization (QLoRA) via the bitsandbytes library, we dramatically shrink this base memory footprint to roughly 18–20GB. This leaves an enormous amount of VRAM overhead exclusively dedicated to the high-memory activations required by multi-modal processing and long-context training batches, unlocking unparalleled serverless efficiency!

Key Code Changes for Gemma 4 Migration

If you are updating your own script or starting fresh, these are the critical adjustments made to the pipeline:

1. Multimodal Input Ordering & Integrated Instructions

While Gemma 4 supports interleaved inputs and a native system role, we recommend providing the image data before the text as a stable convention and merging instructions into the user prompt for this pipeline. We found this 'single-turn' structure more effective for maintaining instruction-following precision and simplifying our custom masking logic.

In the code below, the {"type": "image"} entry acts as a placeholder that signals the processor to inject special image tokens into the chat template. The actual image tensors are then passed separately during the data collation step to ensure the multimodal architecture is adapted correctly.

full_user_content = f"{prompt}\n\nIdentify the breed of the animal in this image."
messages = [
  {
    "role": "user",
    "content": [
      {"type": "image"},  # Image must come first!
      {"type": "text", "text": full_user_content},
    ],
  },
  {
    "role": "assistant",
    "content": [{"type": "text", "text": example["caption"]}]
  }
]

2. Loading the Correct Multimodal Architecture

Gemma 4 natively processes images, video, and even audio (in the E2B and E4B models), which changes how the model must be loaded. To correctly handle these diverse inputs, we explicitly use the AutoModelForMultimodalLM class. While AutoModelForImageTextToText remains a valid option for purely image-based tasks, the multimodal class is the more precise choice for the Gemma 4 architecture, ensuring it is ready to handle video and audio data natively.

from transformers import AutoModelForMultimodalLM
model = AutoModelForMultimodalLM.from_pretrained(model_id, **model_kwargs)

3. Label Masking for Multimodal Data

In Gemma 3, we could hardcode specific token IDs to find where the assistant's response started to mask the prompt. For Gemma 4, we initially tried tokenizing the text prompt separately to find its length, but hit a major snag.

Gemma 4 is highly efficient with media: each image gets a dynamic number of soft tokens exactly fitted to its content. While these image soft tokens are highly stable and pre-computable (their count does not change whether the image is alone or accompanied by text), standard tokenizers can still introduce slight boundary quirks when concatenating text and control tokens after these media tokens. If you tokenize the prompt in isolation, the length might be slightly off compared to the fully assembled chat template, tanking the model's accuracy.

To achieve the highest precision, we implemented a bulletproof backward-search collator. Instead of trying to calculate the prompt length, we search the full _input_ids_ array for the exact tokens of our breed name label. Once found, we step backwards to locate the <|turn> control token that marks the start of the assistant's response, and mask everything before it. This mathematically guarantees the model is trained exactly on the required template structure and the label, without any masking misalignment.

4. Bypassing Custom Layers & Unlocking the Vision Tower

This was the most critical breakthrough! The official Hugging Face implementation for Gemma 4 uses a custom neural network wrapper called Gemma4ClippableLinear for its projection layers. This custom class wraps a standard nn.Linear layer but adds specific logic to clip minimum and maximum activations (input_min, output_max, etc.) to stabilize training.

When we tried to apply standard LoRA by targeting specific layer names like q_proj or v_proj, we hit two major issues:

Activation Clipping Bypass: Standard PEFT/LoRA doesn't natively recognize Gemma4ClippableLinear. If forced to attach to the inner .linear weights, it bypasses the parent wrapper entirely. Without that crucial activation clipping during the forward pass, the model's activations become unstable, and the training loss explodes.
Frozen Vision Tower: Even if we fixed the text backbone, standard text-focused LoRA configurations often miss the vision tower's projection layers, leaving the model's "eyes" frozen during training.

The solution is to use the macro target_modules="all-linear". This tells the PEFT library to recursively scan the entire model tree. It safely identifies and wraps nested linear layers without breaking the outer Gemma4ClippableLinear clipping logic. Crucially, it also ensures that every linear layer across both the language model and the vision tower is adapted to your data, without sacrificing architectural stability.

5. Results

By combining the multimodal architecture, bulletproof masking, and full-tower LoRA, we achieved a nice improvement in the model accuracy.

Note that Gemma 4 baseline performance (89% accuracy) was significantly higher than Gemma 3 Baseline performance (67% accuracy) so in this case the accuracy improvement is more modest, but still significant.

Intermediate Results (700 Samples, ~50 minutes Run)

Even with a small subset of 700 training images, we saw a nice boost over the baseline in less than one hour:

Results on 700 training samples and 200 evaluation samples

Final Results (Full Dataset, ~4.25 Hours Run)

Running the full Oxford-IIIT Pet dataset (~4,000 training images and 3,669 evaluation images) yielded our peak performance (STOA for this dataset is 94% accuracy):

Results on 4000 training samples and 3669 evaluation samples

In this run, we utilized a more aggressive LoRA configuration than typical text-only runs: a Rank 64 / Alpha 64 setup with a 5e-5 learning rate. This gave the model enough "surface area" to refine its visual features for the specific nuances of the pet dataset.

6. Managing VRAM with QLoRA & Gradient Checkpointing

While 96GB of VRAM on the RTX 6000 Pro is massive, training a 31B parameter model with LoRA still pushes the boundaries of a single GPU. To ensure absolute stability and prevent Out-Of-Memory (OOM) errors during the backward pass, our script implements a two-pronged optimization strategy:

QLoRA (4-bit Quantization): Utilizing BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4") to drastically reduce the model's footprint when loaded on CUDA.
Gradient Checkpointing: Specifically enabled for the 31B model, this trades a slight increase in compute time for a significant reduction in VRAM usage by recalculating activations instead of storing them all in memory.

The Complete Fine-Tuning Workflow on Cloud Run

Before you begin the fine-tuning process, ensure you have the following software and environment configurations in place.

Prerequisites

Google Cloud Project with billing enabled and APIs active (Cloud Run, Artifact Registry, Cloud Build, Secret Manager).
NVIDIA RTX PRO 6000 availability in your region (e.g., europe-west4).
Hugging Face Token: A valid token with access to the Gemma 4 model weights.

Step 0: Set Environment Variables

Set the following environment variables to align with the steps below:

export PROJECT_ID=[YOUR_PROJECT_ID]
export REGION=europe-west4
export HF_TOKEN=[YOUR_HF_TOKEN]
export SERVICE_ACCOUNT="finetune-gemma-job-sa"
export BUCKET_NAME=$PROJECT_ID-gemma4-finetuning-eu
export AR_REPO=gemma4-finetuning-repo
export SECRET_ID=HF_TOKEN
export IMAGE_NAME=gemma4-finetune
export JOB_NAME=gemma4-finetuning-job

Step 1: Get the Code

Whether you're running locally or on the cloud, you'll need the code. Clone the repository and navigate to the project directory:

git clone https://github.com/GoogleCloudPlatform/devrel-demos
cd devrel-demos/ai-ml/finetune_gemma/

Step 2: Test Locally Before Cloud Deployment

Before spinning up massive GPUs in the cloud, it is always a best practice to verify your pipeline locally using a smaller model variant (like the 2B IT model) on a subset of the data.

To run a local CPU test, first activate your virtual environment:

source .venv/bin/activate

Then, execute the script with a very small dataset to ensure the pipeline completes successfully:

python3 finetune_and_evaluate.py \
  --model-id google/gemma-4-e2b-it \
  --device cpu \
  --train-size 20 \
  --eval-size 20 \
  --gradient-accumulation-steps 4 \
  --num-epochs 1

Once you verify that the training pipeline completes successfully, you are ready to scale up to Cloud Run!

Step 3: Stage the Model in GCS

To save startup time and avoid repetitive downloads from the internet during training, stage the model weights (e.g., google/gemma-4-31b-it) in a GCS bucket located in the same region as your Cloud Run job. We provide a utility script within the repository to perform this transfer directly:

# Navigate to the utility directory
cd hf-to-gcs
# Execute the transfer script
python3 hf_to_gcs.py \
  --model-id google/gemma-4-31b-it \
  --bucket $BUCKET_NAME \
  --hf-token $HF_TOKEN

This script ensures that the weights are stored in your project's bucket, enabling high-speed access via volume mounts when the Cloud Run job executes.

Step 4: Build the Container

Use Cloud Build to package your script and dependencies into a container image compatible with CUDA 12.8:

gcloud builds submit --tag $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME:latest .

[!TIP] You can track the real-time progress of your build in the Cloud Build console.

Step 5: Create and Execute the Cloud Run Job

Create the job with GPU support and volume mounts for the GCS bucket holding the model:

gcloud beta run jobs create gemma4-finetuning-job \
  --region $REGION \
  --image gcr.io/$PROJECT_ID/gemma4-finetune \
  --gpu 1 \
  --gpu-type nvidia-rtx-pro-6000 \
  --cpu 30.0 \
  --memory 120Gi \
  --labels dev-tutorial=finetune-gemma \
  --add-volume name=model-volume,type=cloud-storage,bucket=$BUCKET_NAME \
  --add-volume-mount volume=model-volume,mount-path=/mnt/gcs \
  --args="--model-id","/mnt/gcs/google/gemma-4-31b-it/","--output-dir","/mnt/gcs/gemma4-finetuned","--train-size","700","--eval-size","200","--merge"

Then execute it:

gcloud beta run jobs execute gemma4-finetuning-job --region $REGION --async

Conclusion

Migrating to Gemma 4 requires handling its new architecture and response formats, but the effort pays off with its superior reasoning and adherence to instructions. By leveraging Cloud Run Jobs and Serverless Blackwell GPUs, you can train these massive models efficiently without managing servers.

To get started with inference, explore this codelab: Run inference of Gemma 4 model on Cloud Run with RTX 6000 Pro GPU with vLLM.

To learn more about production serving, refer to the Cloud Run Gemma 4 documentation.

Happy fine-tuning! 🎉

Special thanks to Ryan Mullins, Juyeong Ji and Gus Martins from the Gemma 4 team for the helpful review and feedback on this blog.

How I used Gemini CLI to orchestrate a complex RAG migration

Remigiusz Samborski — Tue, 28 Apr 2026 13:02:03 +0000

Building a complex, multi-phase cloud project like a RAG migration is as much about orchestration as it is about code. You have to manage infrastructure (Terraform), backend services (Python), frontend UI (Next.js), data pipelines (BigQuery/AlloyDB), and documentation - all while maintaining a consistent technical strategy.

Standard IDE completions are great for snippets, but they lack the system-level context needed for this kind of engineering. To build this reference architecture, I didn't just use an AI to write code. I used an AI to orchestrate the entire project.

In this final post (see previous part 1 and part 2), I'll share a behind-the-scenes look at using Gemini CLI with the Conductor extension to orchestrate this migration.

In this post, you will learn:

How to leverage terminal-first AI assistants for system-level engineering
How to implement spec-driven development with the Conductor extension
How to use AI-driven Test-Driven Development (TDD) for reliable code generation
How to collaborate with AI agents using the "Human-in-the-Loop" model

Before we dive into the workflow, let's briefly discuss why orchestration is the next logical step for AI-assisted development.

The Developer Experience

Let's walk through my development process step-by-step. The entire specification, plan, and implementation logic is available in the conductor directory of the rag-migration repository.

Spec-driven development with Conductor

Central to my workflow, is the Conductor extension. It's built on the principle of spec-driven development. Instead of jumping straight into code, we define the "source of truth" in Markdown files.

Product Definition (product.md): What are we building?
Tech Stack (tech-stack.md): What tools are we using?
Tracks Registry (tracks.md): What are the major milestones?
Implementation Plans (plan.md for each of the tracks): What are the step-by-step tasks?
Workflow (workflow.md): How are we building the solution?

By having these documents in the codebase, the AI agent (Gemini CLI) always has the high-level context it needs to make smart decisions. It's also a good practice to share those with your team so everyone (including AI agents) is on the same page about the project's direction.

Conductor initialization

The first step for the project initialization is to create product definition and tech stack files. This is handled by running:

/conductor:setup

Gemini CLI will ask you a series of questions to help you define your project, including:

What is the name of your product?
Who are the primary users?
What is the tech stack you are using?
What are the major features you want to implement?
What is the workflow you want to use?

It will then create the initial project structure in the conductor directory, including the product.md and tech-stack.md files.

The lifecycle of a track

Each major feature in this project was implemented as a "Track". A typical track lifecycle consists of:

Track Initialization (/conductor:newTrack):
- The agent creates a spec.md file that describes the goals of the track
- The agent maps the existing codebase and validates assumptions
- The agent creates a plan.md file that describes the steps needed to achieve the goals
Track Execution (/conductor:implement):
- The agent iterates through tasks using a Plan -> Act -> Validate cycle
Track Completion:
- The agent verifies the changes made during the track
- The agent ask for user feedback on the implementation
Track Archivization:
- Once a track is completed, Gemini CLI archives the track in the conductor/archive directory

For example, when I started the initial embeddings track, I initialized it with:

/conductor:newTrack

Gemini CLI researches the codebase, asks clarifying questions and creates a spec.md and plan.md files. Only after I review and approve them, the actual implementation starts.

Terraform for Infrastructure as Code

My product.md file instructs Gemini CLI to write Terraform code for all the resources created during the project. This works really well as all the resources are consistently managed by source code and it's easy to spin up a new environment when needed.

You can see all the Terraform files and infrastructure scripts used in the first track in the infra directory.

Moreover, in the course of the project creation I instructed Gemini CLI to always run terraform plan before terraform apply. Keeping this information in the workflow.md file ensures that such an approach is applied to all tracks.

TDD with an AI agent

One of the most powerful aspects of this workflow is AI-driven Test-Driven Development (TDD). I didn't just ask the agent to "write the code". It followed a strict protocol:

Write Failing Tests: The agent defines the expected behavior in a new test file
Red Phase: It runs the tests and confirms they fail
Green Phase: It writes the minimum code needed to pass the tests
Refactor: It refactors the implementation code and the test code to improve clarity, remove duplication, and enhance performance without changing the external behavior.
Verify Coverage: It verifies that the test coverage meets the project requirements (target: >80% coverage for new code).
Commit Code Changes: The agent commits code changes related to the task.

This ensures that the AI-generated code isn't just "syntactically correct" but functionally verified against my requirements. This workflow is described in the workflow.md file.

Checkpoints and quality gates

At the end of every phase, Gemini CLI runs a "Checkpoint" protocol. This includes:

Automated Verification: Running the full test suite.
Manual Verification: Providing the user with step-by-step instructions to verify the changes.
Auditable Records: Attaching a verification report to the git commit using git notes and update plan.md with the new commit hash.

Conductor commits demonstrating the checkpoint protocol.

Effective Human-in-the-Loop

To achieve an effective AI agent-human development synergy I heavily depended on following solutions:

Gemini CLI in a sandbox with Yolo mode enabled - see my past article for more about it.
Custom sandbox notifier script that runs in another terminal.

This approach provided safe guardrails and allowed me to jump into work on other projects while the AI was working on this one. I was always able to jump back quickly thanks to timely notifications. Moreover the checkpointing mechanism of Conductor allowed me to always have a possibility to revert unnecessary changes or to restart from a known working state.

I also used Antigravity to polish the generated code and the documentation. It was particularly helpful for minor tweaks or refactoring of the code that was generated by Gemini CLI.

Token usage

Throughout the project I used several models (Gemini 3 Pro, Gemini 3 Flash and Gemini 2.5 Flash Lite). The total token consumption was:

Input tokens: ~19M
Cached input tokens: ~66M
Output tokens: ~400k

Notice the high number of cached input tokens, which significantly impacts the spend. The total Vertex AI token cost was around $30. Not bad for several days of AI assisted work.

See the pricing page for more details and please mind that your mileage may vary.

Summary

Software engineering is evolving from writing code to orchestrating agentic workflows. By using tools like Gemini CLI and frameworks like Conductor, you can scale your impact as an architect while ensuring consistent, high-quality implementation.

Ready to build your own AI-assisted development projects?

Thanks for reading

If you found this article helpful, please consider adding 50 claps to this post by pressing and holding the clap button 👏 This will help others find it. You can also share it with your friends on socials.

I'm always eager to share my learnings or chat with fellow developers and AI enthusiasts, so feel free to follow me on LinkedIn, X or Bluesky.

DEV Community: Google AI

Gemini 3.1 Flash-Lite is now generally available on Gemini Enterprise Agent Platform

Software development and engineering

Customer experience and high-volume service

Creative pipelines and gaming

Financial services and data operations

Get started

Building "Sweets Vault" - a multimodal Gemini Agent with physical hardware integration

System architecture overview

Root agent logic

Handling state

Agent tools

Checking progress

Marking task as complete

Integrating physical hardware

Agent prompts

Demo

Conclusion

Future plans

Thanks for reading

Agent Factory Recap: How Gemma 4 Taught Itself Physics

Gemma 4 - What is it?

Omar Sanseviero on how Gemma 4 changes the landscape for agent developers

The Factory Floor

Building a Local Food Tour Agent

Autonomous Python Code Execution

The Shift to Apache 2 Licensing

Developer Q&A

Architectural Decisions and Mixture of Experts (MoE)

Comparing Gemma to Gemini

Fine-Tuning for Specialized Industries

Conclusion

Your turn to build

Connect with us

Google Cloud x NVIDIA Meet Up - 5/20/26 @ 5:30pm, Mountain View, CA

Hacking perfectly square AI videos with Veo 3.1 and NanoBanana 2

TL;DR

Step 1: Generating the "phone format" 9:16 frames with NanoBanana 2

Step 2: Generating the video with Veo 3.1 Lite

Step 3: The ffmpeg post-processing

Why this workaround actually... works

Deploying a Multi-Agent System with Terraform and Cloud Run

Deployment to Cloud Run and the Path to Production

Production Utilities and Server: Building the System's Body

The Application Server

Implementing Telemetry

Monitoring vs. Targeted Evaluation

Viewing the Trace

Infrastructure as Code: Provisioning Secure Cloud Resources

Terraform Resources and Variables

Core Infrastructure Logic

Provision the Infrastructure

Deployment: Containerization and the Cloud Build Pipeline

Deploy Application

Verification: Accessing and Testing Your Deployed Agent

Granting User Permissions

Launch the Proxy

Summary

Local Testing of a Multi-Agent System with Memory

Testing the Agent Locally

Environment Setup

Helper Utilities

Environment Configuration

Local Testing Script

Running the Test

Test Scenario

Phase 1: Teaching & Multimodal Creation (Session 1)

Discovery

Research

Personalization

Image Generation

Phase 2: Long-Term Memory Recall (Session 2)

Summary

Architect A Personalized Multi-Agent System with Long-Term Memory

Infrastructure and Model Setup

Memory Ingestion Logic

Long-term Memory

How Managed Memory Works:

Short-term Memory

Specialist 1: Reddit Scanner (Discovery)

Step 3: The `ffmpeg` post-processing

The Server Entrypoint (`main.py`)

The Generation Logic (`nano_banana_pro.py`)

GCS Upload Helper (`storage_utils.py`)

Data Model (`media_models.py`)

Tool Dependencies (`requirements.txt`)