<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Karl Weinmeister</title>
    <description>The latest articles on DEV Community by Karl Weinmeister (@kweinmeister).</description>
    <link>https://dev.to/kweinmeister</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2926299%2F7c094fc1-b557-4030-b220-fd4fc43ed1bd.jpeg</url>
      <title>DEV Community: Karl Weinmeister</title>
      <link>https://dev.to/kweinmeister</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kweinmeister"/>
    <language>en</language>
    <item>
      <title>How to Use the Gemini Deep Research API in Production</title>
      <dc:creator>Karl Weinmeister</dc:creator>
      <pubDate>Wed, 04 Mar 2026 16:08:05 +0000</pubDate>
      <link>https://dev.to/googleai/how-to-use-the-gemini-deep-research-api-in-production-3cif</link>
      <guid>https://dev.to/googleai/how-to-use-the-gemini-deep-research-api-in-production-3cif</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvgvykoqjjzof8mwmrtxc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvgvykoqjjzof8mwmrtxc.png" alt="Cover image" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How many of us have gone down the research rabbit hole? Way too many tabs, links, and notes in the pursuit of knowledge? It’s all useful stuff, but time-consuming and distracting.&lt;/p&gt;

&lt;p&gt;Since I discovered the &lt;a href="https://ai.google.dev/gemini-api/docs/deep-research?utm_campaign=CDR_0x2b6f3004_default_b488870862&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Gemini Deep Research Agent&lt;/a&gt;, I haven’t turned back. And best of all, it has a powerful and straightforward API to kick off research programmatically. Let’s explore how to use it, and the patterns for including this in a production architecture.&lt;/p&gt;

&lt;h3&gt;
  
  
  Async changes everything
&lt;/h3&gt;

&lt;p&gt;A single research task can trigger dozens of search queries and take several minutes to complete. The asynchronous &lt;a href="https://ai.google.dev/gemini-api/docs/interactions?utm_campaign=CDR_0x2b6f3004_default_b488870862&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Interactions API&lt;/a&gt; provides a polling-based interface with a required &lt;code&gt;background=True&lt;/code&gt; parameter to check on progress.&lt;/p&gt;

&lt;p&gt;If you’ve ever worked with a &lt;a href="https://cloud.google.com/pubsub/docs/overview?utm_campaign=CDR_0x2b6f3004_default_b488870862&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Pub/Sub&lt;/a&gt; pipeline or job queue, this will feel familiar.&lt;/p&gt;

&lt;h3&gt;
  
  
  Meet the Interactions API
&lt;/h3&gt;

&lt;p&gt;The Interactions API is a newer, unified interface for working with Gemini models and agents. It replaces the older &lt;code&gt;generateContent&lt;/code&gt; pattern for scenarios that need state management, tool orchestration, or background execution.&lt;/p&gt;

&lt;p&gt;You create an interaction, point it at the deep research agent, and tell it to run in the background:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;GEMINI_API_KEY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Launch the research agent in the background
&lt;/span&gt;&lt;span class="n"&gt;interaction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interactions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Research the history and future of Solid State Batteries.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;deep-research-pro-preview-12-2025&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;background&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That call returns immediately with an interaction ID. The agent is now off doing its thing, autonomously planning search queries, reading pages, and iterating on its analysis. Your application is free to do whatever it needs to do in the meantime.&lt;/p&gt;

&lt;h3&gt;
  
  
  Polling for results
&lt;/h3&gt;

&lt;p&gt;Now you need a way to check whether the agent has finished. The status field tells you everything you need to know:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;interaction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interactions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;interaction&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;interaction&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;completed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# The full research report is ready
&lt;/span&gt;        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;interaction&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;interaction&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Research failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;interaction&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;

    &lt;span class="c1"&gt;# Still working. Check again in 10 seconds.
&lt;/span&gt;    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Taking it to production with Cloud Run
&lt;/h3&gt;

&lt;p&gt;In a notebook, a while True loop gets the job done. In production, you want something that scales, recovers from failures, and doesn’t burn compute waiting. Google Cloud offers three Cloud Run compute models that each map to a different integration pattern with the Deep Research agent.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cloud Run service: webhook-triggered research
&lt;/h3&gt;

&lt;p&gt;A &lt;a href="https://cloud.google.com/run/docs/overview/what-is-cloud-run?utm_campaign=CDR_0x2b6f3004_default_b488870862&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Cloud Run service&lt;/a&gt; works when you want to trigger research from an HTTP request. The service accepts the request, kicks off the agent, stores the interaction ID, and returns immediately. A separate mechanism (a &lt;a href="https://cloud.google.com/scheduler/docs/overview?utm_campaign=CDR_0x2b6f3004_default_b488870862&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Cloud Scheduler&lt;/a&gt; cron, a &lt;a href="https://cloud.google.com/workflows/docs/overview?utm_campaign=CDR_0x2b6f3004_default_b488870862&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Cloud Workflow&lt;/a&gt;, or a callback) handles checking the results later.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ResearchRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;

&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/research&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;start_research&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ResearchRequest&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;interaction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interactions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deep-research-pro-preview-12-2025&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;background&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Store the ID for later retrieval (e.g., in Firestore or Cloud SQL)
&lt;/span&gt;    &lt;span class="nf"&gt;save_interaction_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;interaction&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interaction_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;interaction&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;started&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Cloud Run job: batch research tasks
&lt;/h3&gt;

&lt;p&gt;A &lt;a href="https://cloud.google.com/run/docs/overview/what-is-cloud-run?utm_campaign=CDR_0x2b6f3004_default_b488870862&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Cloud Run job&lt;/a&gt; is a natural fit for one-shot or scheduled research. Jobs execute code and stop, which maps cleanly to “launch, poll, write, exit.” If you have a batch of research topics, you can fan them out as parallel job tasks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.cloud&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;storage&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_research_job&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;topic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RESEARCH_TOPIC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Default research topic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;interaction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interactions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deep-research-pro-preview-12-2025&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;background&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Poll until done
&lt;/span&gt;    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interactions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;interaction&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;completed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Write the report to Cloud Storage and exit
&lt;/span&gt;            &lt;span class="n"&gt;bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;storage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-research-reports&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;blob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;interaction&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.md&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;upload_from_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Research failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;run_research_job&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Cloud Run worker pool: continuous research dispatcher
&lt;/h3&gt;

&lt;p&gt;The most interesting option for a production pipeline is a &lt;a href="https://cloud.google.com/run/docs/deploy-worker-pools?utm_campaign=CDR_0x2b6f3004_default_b488870862&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Cloud Run worker pool&lt;/a&gt;. Worker pools are designed for continuous, non-HTTP, pull-based background processing. They don’t need a public endpoint, they don’t autoscale by default (you bring your own logic), and they cost &lt;a href="https://cloud.google.com/blog/products/serverless/exploring-cloud-run-worker-pools-and-kafka-autoscaler?utm_campaign=CDR_0x2b6f3004_default_b488870862&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;up to 40% less&lt;/a&gt; than instance-billed services.&lt;/p&gt;

&lt;p&gt;If you’re building a system that continuously pulls research requests from a &lt;a href="https://cloud.google.com/pubsub/docs/overview?utm_campaign=CDR_0x2b6f3004_default_b488870862&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Pub/Sub&lt;/a&gt; subscription, dispatches them to the agent, and writes completed reports to storage, a worker pool is purpose-built for that pattern.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.cloud&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pubsub_v1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;storage&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;subscriber&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pubsub_v1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SubscriberClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;subscription_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;projects/my-project/subscriptions/research-requests&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;topic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;interaction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interactions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deep-research-pro-preview-12-2025&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;background&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Poll until done, then write results
&lt;/span&gt;    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interactions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;interaction&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;completed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;storage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-research-reports&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;blob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;interaction&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.md&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;upload_from_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ack&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;nack&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# Retry later
&lt;/span&gt;            &lt;span class="k"&gt;return&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Pull messages continuously (worker pool stays alive)
&lt;/span&gt;&lt;span class="n"&gt;streaming_pull&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subscriber&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subscribe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subscription_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;handle_message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;streaming_pull&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Grounding with your own data
&lt;/h3&gt;

&lt;p&gt;Web research is powerful, but sometimes you need the agent to work with private data or internal documents. The Deep Research agent supports a file search tool for exactly this. Think of it as RAG, but orchestrated automatically by the agent rather than wired up manually.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;interaction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interactions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Compare our 2025 fiscal year report against current public web news.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;deep-research-pro-preview-12-2025&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;background&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_search_store_names&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;FILE_SEARCH_STORE_NAME&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is where the architecture gets interesting for enterprise use cases. The agent can combine internet research with grounded analysis of your internal documents, all within a single research task.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stateful follow-ups
&lt;/h3&gt;

&lt;p&gt;After a research task completes, you can ask follow-up questions that reference the original research context without re-running the entire workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;follow_up&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interactions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Can you elaborate on the key findings?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-3.1-pro-preview&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;previous_interaction_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;interaction&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;follow_up&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Getting started
&lt;/h3&gt;

&lt;p&gt;This &lt;a href="https://colab.research.google.com/github/kweinmeister/notebooks/blob/master/deep_research.ipynb" rel="noopener noreferrer"&gt;Deep Research notebook&lt;/a&gt; walks you through the entire flow, from setting up the client to launching research tasks. For pricing details, check the &lt;a href="https://ai.google.dev/gemini-api/docs/pricing#pricing-for-agents?utm_campaign=CDR_0x2b6f3004_default_b488870862&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Gemini API pricing page&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Ready to stop Googling and start delegating? Grab the notebook and run your first deep research task. I’d love to hear what you build with it. Come find me on &lt;a href="https://www.linkedin.com/in/karlweinmeister/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;, &lt;a href="https://x.com/kweinmeister" rel="noopener noreferrer"&gt;X&lt;/a&gt;, or &lt;a href="https://bsky.app/profile/kweinmeister.bsky.social" rel="noopener noreferrer"&gt;Bluesky&lt;/a&gt; and share what research tasks you’re automating.&lt;/p&gt;




</description>
      <category>googlecloudrun</category>
      <category>deepresearch</category>
      <category>pubsub</category>
      <category>asynchronousprogramming</category>
    </item>
    <item>
      <title>Skills Made Easy with Google Antigravity and Gemini CLI</title>
      <dc:creator>Karl Weinmeister</dc:creator>
      <pubDate>Thu, 26 Feb 2026 16:52:06 +0000</pubDate>
      <link>https://dev.to/googleai/skills-made-easy-with-google-antigravity-and-gemini-cli-4chb</link>
      <guid>https://dev.to/googleai/skills-made-easy-with-google-antigravity-and-gemini-cli-4chb</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9jkxr5hhcd0zeogdlxdf.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9jkxr5hhcd0zeogdlxdf.jpeg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When you ask an AI assistant a question, you have two choices: hope its training is current, or burn through tokens reading documentation. What if you could give your agent the right answer, right away?&lt;/p&gt;

&lt;p&gt;That’s the power of &lt;strong&gt;Agent Skills&lt;/strong&gt;. Skills are reusable packages of knowledge that extend what your agent can do without overwhelming its context window. Defined with a &lt;code&gt;SKILL.md&lt;/code&gt; file, they allow you to teach your agent how to accomplish tasks consistently. Instead of forcing an agent to process an entire library’s worth of documentation at once, Skills act as &lt;strong&gt;on-demand expertise&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;You can learn more about the open standard at &lt;a href="https://agentskills.io" rel="noopener noreferrer"&gt;Agent Skills&lt;/a&gt; and discover community capabilities at &lt;a href="https://skills.sh" rel="noopener noreferrer"&gt;skills.sh&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In this post, we’ll explore how to manage these skills in the &lt;a href="https://geminicli.com/" rel="noopener noreferrer"&gt;Gemini CLI&lt;/a&gt;, a powerful terminal-native AI assistant, and &lt;a href="https://antigravity.google/" rel="noopener noreferrer"&gt;Antigravity&lt;/a&gt;, an advanced agentic coding assistant.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installing skills
&lt;/h3&gt;

&lt;p&gt;Both the Gemini CLI and Antigravity access skills by reading them from standard directories on your local machine. To add new skills, you can drop them into these locations:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5adxw6li6ear7x0zuye5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5adxw6li6ear7x0zuye5.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Managing Skills in Gemini CLI
&lt;/h3&gt;

&lt;p&gt;Gemini CLI offers built-in &lt;a href="https://geminicli.com/docs/cli/skills/" rel="noopener noreferrer"&gt;skill management&lt;/a&gt;. You can use either interactive slash commands during a session, or terminal commands:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnq8aia6riu2ya4dkf8s4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnq8aia6riu2ya4dkf8s4.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These commands makes it easy to pull in skills from a Git repository or local directory, and manage whether they are active for your current project.&lt;/p&gt;

&lt;p&gt;For example, if you want to install a specific skill located inside a subdirectory of a larger repository (like Firebase’s &lt;code&gt;firebase-ai-logic-basics&lt;/code&gt;), you can use the &lt;code&gt;--path&lt;/code&gt; flag:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdkbym3egq6v2d06fugjp.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdkbym3egq6v2d06fugjp.gif"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gemini skills &lt;span class="nb"&gt;install &lt;/span&gt;https://github.com/firebase/agent-skills.git — path skills/firebase-ai-logic-basics
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To audit which skills are currently loaded into your agent’s context, you can run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gemini skills list
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command provides a clear overview of all discovered skills across your workspace and global environments, showing their descriptions and file locations so you know exactly what expertise your agent has access to.&lt;/p&gt;

&lt;h3&gt;
  
  
  Unified management with the skills tool
&lt;/h3&gt;

&lt;p&gt;While Gemini CLI has robust built-in tools, what if you want to manage skills across &lt;em&gt;both&lt;/em&gt; Gemini CLI and Antigravity simultaneously? Managing them by hand across the different &lt;code&gt;~/.gemini/skills/&lt;/code&gt; and &lt;code&gt;~/.gemini/antigravity/skills/&lt;/code&gt; directories can get tedious.&lt;/p&gt;

&lt;p&gt;That’s where the open-source CLI tool from &lt;a href="https://github.com/vercel-labs/skills" rel="noopener noreferrer"&gt;vercel-labs/skills&lt;/a&gt; shines. It uses a &lt;a href="https://en.wikipedia.org/wiki/Symbolic_link" rel="noopener noreferrer"&gt;symlink&lt;/a&gt; approach to easily install, update, and remove skills centrally, sharing them across multiple agents without duplicating files.&lt;/p&gt;

&lt;h3&gt;
  
  
  Getting Started with skills
&lt;/h3&gt;

&lt;p&gt;The easiest way to begin with the unified CLI is by using the add command. You can add the &lt;code&gt;-a&lt;/code&gt; or &lt;code&gt;--agent&lt;/code&gt; parameter for each client you’d like to add the skill to.&lt;/p&gt;

&lt;p&gt;For example, suppose you want to equip your agent with deep knowledge of Firebase to help build full-stack apps. You could run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx skills add firebase/agent-skills &lt;span class="nt"&gt;-a&lt;/span&gt; gemini-cli &lt;span class="nt"&gt;-a&lt;/span&gt; antigravity
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flxdcra24h6btiue19tbg.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flxdcra24h6btiue19tbg.gif"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;⚠️ Note that the skill will be added to the Gemini CLI even without the &lt;code&gt;-a&lt;/code&gt; parameter, as it supports the default &lt;code&gt;~/.agents/skills&lt;/code&gt; global directory. The extra parameter provided here for clarity to show both clients in one command.&lt;/p&gt;

&lt;p&gt;This installs the skill and instantly makes it available to both Gemini and Antigravity. By adding firebase/agent-skills, your agents can reliably build and deploy apps with Firebase Auth, Firestore, and more. For more details on how this skill works, read &lt;a href="https://firebase.blog/posts/2026/02/ai-agent-skills-for-firebase" rel="noopener noreferrer"&gt;Introducing Agent Skills for Firebase&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you’re looking for skills related to a specific technology, you can search for them directly from your terminal. For instance, if you’re building a mobile app, you might want to find capabilities related to Flutter. You can use the find command to discover relevant skills:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx skills find flutter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1zeqtayx9pmwubkb2boy.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1zeqtayx9pmwubkb2boy.gif"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This command searches the community skills registry and returns a list of matching capabilities, displaying the most popular ones first alongside their installation commands. You can quickly copy those commands to add the expertise directly to your active agents.&lt;/p&gt;

&lt;h3&gt;
  
  
  Keeping your agent’s context clean
&lt;/h3&gt;

&lt;p&gt;It’s easy to get excited and install dozens of skills. While progressive disclosure means your agent isn’t reading the &lt;em&gt;entire&lt;/em&gt; instruction manual for every skill on every prompt, simply loading the names, descriptions, and metadata of 50 different skills can still clutter the initial context window, leading to confusion or degraded performance.&lt;/p&gt;

&lt;p&gt;To keep your agents focused and efficient, make sure to keep your essential skills up-to-date with your chosen tool’s update commands. More importantly, if you find you aren’t using a skill anymore, take a moment to disable or remove it (e.g., &lt;code&gt;/skills disable &amp;lt;name&amp;gt;&lt;/code&gt; in Gemini CLI or &lt;code&gt;npx skills remove &amp;lt;name&amp;gt;&lt;/code&gt;) to free up that precious context space.&lt;/p&gt;

&lt;p&gt;By managing skills in &lt;a href="https://geminicli.com/" rel="noopener noreferrer"&gt;Gemini CLI&lt;/a&gt; and &lt;a href="https://antigravity.google/" rel="noopener noreferrer"&gt;Antigravity&lt;/a&gt; with the &lt;a href="https://skills.sh/docs/cli" rel="noopener noreferrer"&gt;skills CLI&lt;/a&gt;, you can tailor and organize your environment to your liking. To get more hands-on experience building skills, you can try out the &lt;a href="https://codelabs.developers.google.com/gemini-cli/how-to-create-agent-skills-for-gemini-cli#0" rel="noopener noreferrer"&gt;Agent Skills&lt;/a&gt; codelab.&lt;/p&gt;

&lt;p&gt;Have you built any interesting workflows using Agent Skills? I’d love to hear how you’re extending your agents. Share what you’ve built with me on &lt;a href="https://www.linkedin.com/in/karlweinmeister/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;, &lt;a href="https://x.com/kweinmeister" rel="noopener noreferrer"&gt;X&lt;/a&gt;, or &lt;a href="https://bsky.app/profile/kweinmeister.bsky.social" rel="noopener noreferrer"&gt;Bluesky&lt;/a&gt;!&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/UVcMo8iV7LU"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;




</description>
      <category>gemini</category>
      <category>antigravity</category>
      <category>agenticai</category>
      <category>agents</category>
    </item>
    <item>
      <title>Performance shouldn’t be an afterthought: Hardening the AI-Assisted SDLC</title>
      <dc:creator>Karl Weinmeister</dc:creator>
      <pubDate>Mon, 26 Jan 2026 17:31:22 +0000</pubDate>
      <link>https://dev.to/googleai/performance-shouldnt-be-an-afterthought-hardening-the-ai-assisted-sdlc-45c8</link>
      <guid>https://dev.to/googleai/performance-shouldnt-be-an-afterthought-hardening-the-ai-assisted-sdlc-45c8</guid>
      <description>&lt;h3&gt;
  
  
  Performance shouldn’t be an afterthought: Hardening the AI-assisted SDLC
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh7xjx9l2ksb7z4kcgb4v.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh7xjx9l2ksb7z4kcgb4v.jpeg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It’s amazing how quickly you can now build a working application with AI assistance. It’s even more amazing how easily you can harden your application for production. But that’s a step that’s often left out of the “vibe coding” software development lifecycle, or SDLC. I hope to change that.&lt;/p&gt;

&lt;p&gt;Why does it matter? The impact of high latency is lost users, and the impact of excess memory usage is lost budget.&lt;/p&gt;

&lt;p&gt;Study after study shows that your application’s latency directly &lt;a href="https://arxiv.org/pdf/2101.09086" rel="noopener noreferrer"&gt;correlates with user satisfaction&lt;/a&gt;, a key ingredient for business success. Meanwhile, your application’s memory usage impacts your Cloud infrastructure cost. For example, &lt;a href="https://cloud.google.com/run?utm_campaign=CDR_0x2b6f3004_default_b478846417&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Cloud Run&lt;/a&gt; offers &lt;a href="https://docs.cloud.google.com/run/docs/configuring/services/memory-limits?utm_campaign=CDR_0x2b6f3004_default_b478846417&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;memory limits&lt;/a&gt; at various tiers ranging from 512 MiB to 32 GiB. Not to mention, if you underprovision memory, your application reliability will suffer.&lt;/p&gt;

&lt;p&gt;In this post, I’ll walk through steps I recommend that ensure your application is hardened for production. I’ll use &lt;a href="https://antigravity.google" rel="noopener noreferrer"&gt;Google Antigravity&lt;/a&gt; to build an application with &lt;a href="https://github.com/kweinmeister/perplexity-calculator" rel="noopener noreferrer"&gt;sample application code&lt;/a&gt; available on GitHub.&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/hEsZt_Gi-UA"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;h3&gt;
  
  
  Discovery and Tool Selection
&lt;/h3&gt;

&lt;p&gt;If you aren’t an expert in the tooling ecosystem for your application’s language, use AI to bridge the gap. Avoid guessing and ask for industry standards. For example, you can ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“I need to profile a Python application for both CPU execution time and memory leaks. What are the most modern, low-overhead tools available? I know about cProfile, but are there better options with visualization (like flame graphs)?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What modern stack might your AI assistant suggest? &lt;a href="https://github.com/plasma-umass/scalene" rel="noopener noreferrer"&gt;scalene&lt;/a&gt; is a high-performance profiler whose standout capability is separating time spent in Python versus native code. To dig into memory details, &lt;a href="https://github.com/bloomberg/memray" rel="noopener noreferrer"&gt;memray&lt;/a&gt; can track allocations in native extensions and generate flame graphs that make it easy to spot areas for improvement. Finally, &lt;a href="https://pypi.org/project/pytest-benchmark/" rel="noopener noreferrer"&gt;pytest-benchmark&lt;/a&gt; is a useful plugin that handles warm-up rounds and statistical analysis automatically.&lt;/p&gt;

&lt;p&gt;If you’re writing code in other languages, the same strategy applies. You might discover &lt;a href="https://github.com/google/pprof" rel="noopener noreferrer"&gt;pprof&lt;/a&gt; for Go, &lt;a href="https://github.com/nodejs/clinic.js" rel="noopener noreferrer"&gt;clinic.js&lt;/a&gt; for Node.js, and other useful tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  Establish a Baseline
&lt;/h3&gt;

&lt;p&gt;My use case is &lt;a href="https://huggingface.co/docs/transformers/en/perplexity" rel="noopener noreferrer"&gt;calculating the perplexity&lt;/a&gt; of a given text, which is helpful for AI detection and other use cases. The initial implementation started with a naïve algorithm which processes one token at a time, which isn’t uncommon when you simply ask for a solution.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seq_len&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;current_token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;input_ids_int64&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# 1. Construct single-token input
&lt;/span&gt;    &lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt;&lt;span class="n"&gt;current_token&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;past_key_values&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. Run inference for just this token
&lt;/span&gt;    &lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Optimize for Speed
&lt;/h3&gt;

&lt;p&gt;While this code works, it’s slow. With our tools selected from the research phase, we can ask our AI agent to benchmark the baseline code.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Generate a Python script using pytest-benchmark to benchmark my perplexity function against a baseline. Create a mock dataset to simulate load.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once we have a benchmark, we can then ask our AI agent to optimize it:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Profile this baseline code and suggest an optimized routine. Focus on throughput.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A standard engineering strategy to address loop overhead is &lt;a href="https://en.wikipedia.org/wiki/Array_programming" rel="noopener noreferrer"&gt;vectorization&lt;/a&gt;. The revised approach feeds the entire sequence to the model in one go:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_perplexity_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. Encode entire text at once
&lt;/span&gt;    &lt;span class="n"&gt;input_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. Single inference call for the whole sequence
&lt;/span&gt;    &lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;logits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# Shape: [1, SeqLen, Vocab]
&lt;/span&gt;
    &lt;span class="c1"&gt;# 3. Vectorized loss calculation (No loops)
&lt;/span&gt;    &lt;span class="c1"&gt;# ... numpy vector operations ...
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mean_nll&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In my test environment, this change led to an overall &lt;strong&gt;2.5x&lt;/strong&gt; speed improvement over the naïve loop.&lt;/p&gt;

&lt;h3&gt;
  
  
  Optimize Memory Usage
&lt;/h3&gt;

&lt;p&gt;Unfortunately, this speed came at a cost. By loading all logits for the entire sequence into memory at once, I created an unbounded memory situation. Long documents would cause peak memory usage to spike uncontrollably. I had solved for latency, but in doing so, I had broken cost constraints.&lt;/p&gt;

&lt;p&gt;How could I prompt Antigravity to help?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Analyze my optimized perplexity routine. The target environment is Google Cloud Run with a strict 2GB memory limit. Identify the peak memory usage and refactor the code to stay under this limit without reverting to the slow loop.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The solution balanced speed and memory, processing data in batches large enough to achieve high throughput but small enough to manage peak memory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;
&lt;span class="n"&gt;logits_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;append_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;logits_list&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;get_logits&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:])&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logits_list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Process this chunk
&lt;/span&gt;        &lt;span class="nf"&gt;_process_logits_chunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logits_list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;targets_list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Free memory immediately to clip the peak
&lt;/span&gt;        &lt;span class="n"&gt;logits_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Final Thoughts
&lt;/h3&gt;

&lt;p&gt;Before unleashing this process across your codebase, let’s be clear that performance engineering is a rigorous discipline that goes beyond optimizing functions. Industry veteran Brendan Gregg famously warns against the &lt;a href="https://www.brendangregg.com/methodology.html" rel="noopener noreferrer"&gt;Streetlight Anti-Method&lt;/a&gt;: looking for performance problems where it’s easiest, rather than where the problems actually exist.&lt;/p&gt;

&lt;p&gt;Providing your AI assistant the broader context of your application is key, and it’s easy to overlook important details in your prompting. An AI assistant doesn’t know that your production workload is 10 million rows, not the 100 rows in your test script. It can’t see that your database is missing an index or that your network bandwidth is saturated. Most importantly, an AI assistant doesn’t know your intent. If you steer it towards speeding up a query, it will focus on what you asked for, but it likely won’t ask why that data isn’t cached in the first place.&lt;/p&gt;

&lt;p&gt;With those considerations in mind, using AI as a final check is a low-risk, high-reward step. It takes minutes and often catches low-hanging fruit that is overlooked. Then, the next step is maintaining your application’s performance. Consider leveraging tools for &lt;a href="https://cloud.google.com/discover/what-is-application-monitoring?utm_campaign=CDR_0x2b6f3004_default_b478846417&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;continuous application monitoring&lt;/a&gt; to identify regressions and ensure reliability in a live environment.&lt;/p&gt;

&lt;p&gt;I’d love to hear how you’re innovating with your software development lifecycle. Connect with me on &lt;a href="https://www.linkedin.com/in/karlweinmeister/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;, &lt;a href="https://x.com/kweinmeister" rel="noopener noreferrer"&gt;X&lt;/a&gt;, or &lt;a href="https://bsky.app/profile/kweinmeister.bsky.social" rel="noopener noreferrer"&gt;Bluesky&lt;/a&gt;!&lt;/p&gt;

</description>
      <category>memorymanagement</category>
      <category>googleantigravity</category>
      <category>performance</category>
      <category>ai</category>
    </item>
    <item>
      <title>AI Agent Engineering in Go with the Google ADK</title>
      <dc:creator>Karl Weinmeister</dc:creator>
      <pubDate>Tue, 20 Jan 2026 16:41:44 +0000</pubDate>
      <link>https://dev.to/googleai/ai-agent-engineering-in-go-with-the-google-adk-534o</link>
      <guid>https://dev.to/googleai/ai-agent-engineering-in-go-with-the-google-adk-534o</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9arbbpsbkm70dpah0ziw.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9arbbpsbkm70dpah0ziw.jpeg" width="800" height="339"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;While Python &lt;a href="https://survey.stackoverflow.co/2025/technology#most-popular-technologies-language-prof-ai" rel="noopener noreferrer"&gt;remains popular&lt;/a&gt; for model training and research, the requirements for &lt;em&gt;serving&lt;/em&gt; and &lt;em&gt;orchestrating&lt;/em&gt; AI agents align closely with Go’s strengths: low latency, high concurrency, and type safety.&lt;/p&gt;

&lt;p&gt;Transitioning from a prototype to a production agent introduces engineering challenges that &lt;a href="https://go.dev/doc/install" rel="noopener noreferrer"&gt;Golang&lt;/a&gt; can handle exceptionally well. Go’s static typing eliminates runtime errors when parsing structured LLM outputs. Its &lt;a href="https://go.dev/tour/concurrency/1" rel="noopener noreferrer"&gt;lightweight goroutines&lt;/a&gt;, which start with just a &lt;a href="https://dev.to/jones_charles_ad50858dbc0/in-depth-go-concurrency-a-practical-guide-to-goroutine-performance-nee"&gt;few kilobytes&lt;/a&gt; of stack memory, allow agents to handle thousands of concurrent tool executions without the overhead of heavy thread management.&lt;/p&gt;

&lt;p&gt;In recent years, Go’s adoption for cloud-native microservices has surged: it showed the &lt;a href="https://devecosystem-2025.jetbrains.com/tools-and-trends" rel="noopener noreferrer"&gt;fourth-highest promise&lt;/a&gt; for languages, and maintained a &lt;a href="https://go.dev/blog/survey2024-h2-results" rel="noopener noreferrer"&gt;93% satisfaction rate&lt;/a&gt;. Google’s &lt;a href="https://google.github.io/adk-docs/get-started/go/" rel="noopener noreferrer"&gt;Agent Development Kit&lt;/a&gt;, or ADK, bridges the gap between these architectural advantages and generative AI.&lt;/p&gt;

&lt;p&gt;In this guide, I’ll walk through scaffolding a new project and deploying it as a secure microservice on Google Cloud.&lt;/p&gt;

&lt;h3&gt;
  
  
  Get Started with the Agent Starter Pack
&lt;/h3&gt;

&lt;p&gt;The good news is you don’t need to start from scratch. The &lt;a href="https://github.com/GoogleCloudPlatform/agent-starter-pack" rel="noopener noreferrer"&gt;&lt;strong&gt;Agent Starter Pack&lt;/strong&gt;&lt;/a&gt; is a CLI tool that scaffolds a production-ready folder structure, including CI/CD pipelines, infrastructure configuration, and boilerplate code.&lt;/p&gt;

&lt;p&gt;To get started, just run the create command with &lt;a href="https://docs.astral.sh/uv/getting-started/installation/" rel="noopener noreferrer"&gt;uvx&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;uvx agent-starter-pack create&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The CLI guides you through an interactive setup. For this project, I selected:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Project Name:&lt;/strong&gt; my-first-go-agent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Template:&lt;/strong&gt; Option &lt;strong&gt;6&lt;/strong&gt; (Go ADK, Simple ReAct agent)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD:&lt;/strong&gt; Option &lt;strong&gt;3&lt;/strong&gt; (GitHub Actions)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Region:&lt;/strong&gt; us-central1&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkampj5s8no4veqyslwtp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkampj5s8no4veqyslwtp.png" width="753" height="373"&gt;&lt;/a&gt;&lt;/p&gt;
Agent Starter Pack CLI



&lt;p&gt;The tool automatically authenticates with Google Cloud, enables the necessary Vertex AI APIs, and configures your local environment. Once you see the green &lt;strong&gt;Success!&lt;/strong&gt; message, you’re good to go.&lt;/p&gt;

&lt;h3&gt;
  
  
  Web User Interface
&lt;/h3&gt;

&lt;p&gt;One of the most convenient features of the ADK is the ability to visually debug your agent before deploying it. By running the command &lt;code&gt;make install &amp;amp;&amp;amp; make playground&lt;/code&gt;, you launch a local development server with a built-in UI. Yes, it has a chat window, but it goes way beyond that by tracing events, tool calls, and more.&lt;/p&gt;

&lt;p&gt;In the screenshot below, I’m interacting with the newly created agent. The agent is configured with a &lt;a href="https://ai.google.dev/gemini-api/docs/langgraph-example?utm_campaign=CDR_0x2b6f3004_default_b476693958&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;ReAct&lt;/a&gt; (Reasoning and Acting) pattern — a framework &lt;a href="https://arxiv.org/abs/2210.03629" rel="noopener noreferrer"&gt;introduced by Yao et al. in 2022&lt;/a&gt; that has become foundational in agentic AI. The ReAct pattern’s continuous loop of “Thought,” “Action,” and “Observation” enhances problem-solving and interpretability, making the agent’s decision-making process transparent. It recognized the intent, invoked the get_weather tool, and returned the structured data (“It’s sunny and 72°F”).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwo0ok9ne047inpvdg60h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwo0ok9ne047inpvdg60h.png" width="800" height="478"&gt;&lt;/a&gt;&lt;/p&gt;
Agent Development Kit web user interface



&lt;h3&gt;
  
  
  Understanding the Code
&lt;/h3&gt;

&lt;p&gt;Now that we’ve seen the agent in action, let’s look at the Go code that makes this work. The logic lives in &lt;code&gt;agent/agent.go&lt;/code&gt;. This file handles tool definitions, model configuration, and initialization.&lt;/p&gt;

&lt;p&gt;The ADK uses standard Go structs to define how the Large Language Model (LLM) interacts with your code. For example, to define the input parameters for our weather tool, we simply define a struct with &lt;code&gt;json&lt;/code&gt; and &lt;code&gt;jsonschem&lt;/code&gt;a tags:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;GetWeatherArgs&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;City&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="s"&gt;`json:"city" jsonschema:"City name to get weather for"`&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GetWeatherResult defines the structure of the data returned to the agent after the tool executes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;GetWeatherResult&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="n"&gt;Weather&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="s"&gt;`json:"weather"`&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GetWeather is a standard Golang function that accepts &lt;a href="https://pkg.go.dev/google.golang.org/adk/tool#Context" rel="noopener noreferrer"&gt;tool.Context&lt;/a&gt; and the arguments struct, performing the business logic and returning the result struct.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;GetWeather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="n"&gt;tool&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="n"&gt;GetWeatherArgs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;GetWeatherResult&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;GetWeatherResult&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;Weather&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"It's sunny and 72°F in "&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;City&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The NewRootAgent function is responsible for assembling and returning the &lt;a href="https://pkg.go.dev/google.golang.org/adk/agent#Agent" rel="noopener noreferrer"&gt;agent.Agent&lt;/a&gt; instance that the application launcher requires. It begins by initializing the model configuration, creating a &lt;code&gt;gemini-2.5-flash&lt;/code&gt; model instance backed by &lt;code&gt;genai.BackendVertexAI&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Next, it bridges the gap between Go code and the LLM by wrapping the local GetWeather function into a &lt;code&gt;[functiontool]&lt;/code&gt;(&lt;a href="https://pkg.go.dev/google.golang.org/adk/tool/functiontool" rel="noopener noreferrer"&gt;https://pkg.go.dev/google.golang.org/adk/tool/functiontool&lt;/a&gt;). This step registers the tool with the name &lt;code&gt;get\_weather&lt;/code&gt; and provides the necessary description for the model’s context. Finally, it constructs the agent using &lt;a href="https://pkg.go.dev/google.golang.org/adk/agent/llmagent#New" rel="noopener noreferrer"&gt;llmagent.New&lt;/a&gt;, which combines the initialized Gemini model, the system instructions that define the agent’s behavior, and the slice of available tools into a single unit.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;NewRootAgent&lt;/code&gt; looks like this (with some error-handling removed):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;NewRootAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;gemini&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"gemini-2.5-flash"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ClientConfig&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;Backend&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BackendVertexAI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="p"&gt;})&lt;/span&gt;

 &lt;span class="n"&gt;weatherTool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;functiontool&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;functiontool&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"get_weather"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;Description&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Get the current weather for a city."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;GetWeather&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

 &lt;span class="n"&gt;rootAgent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;llmagent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llmagent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"my-first-go-agent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;Description&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"A helpful AI assistant."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;Instruction&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"You are a helpful AI assistant designed to provide accurate and useful information."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;Tools&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;tool&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tool&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;weatherTool&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
 &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Testing
&lt;/h3&gt;

&lt;p&gt;The project contains both unit tests for internal logic, and end-to-end tests for server integration.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;agent/agent\_test.go&lt;/code&gt;, the GetWeather function is called with a suite of test cases, and verifies that the output string matches its expectations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;TestGetWeather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="c"&gt;// tests struct initialized with "San Francisco" and "New York"&lt;/span&gt;

 &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tt&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;tests&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="c"&gt;// Pass nil for tool.Context since GetWeather doesn't use it&lt;/span&gt;
   &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;GetWeather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;GetWeatherArgs&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;City&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
   &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatalf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"GetWeather() error = %v"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="p"&gt;}&lt;/span&gt;
   &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Weather&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wantCity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Errorf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"GetWeather() = %v, want city %v in response"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Weather&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wantCity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The end-to-end tests verify that the agent works correctly when running as a server, specifically checking that A2A or Agent-to-Agent protocol support is working correctly. The E2E tests start a real instance of the server, sending HTTP requests to it, and check the responses. Here’s a snippet from &lt;code&gt;e2e/integration/server\_e2e\_test.go&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;TestA2AMessageSend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Short&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Skip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Skipping E2E test in short mode"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Start server (local variable to avoid race conditions)&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Starting server process"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;serverProcess&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;startServer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;stopServer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;serverProcess&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;waitForServer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;90&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Server failed to start"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Server process started"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can run all tests with &lt;code&gt;make test&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;make &lt;span class="nb"&gt;test                      
&lt;/span&gt;go &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; ./agent/... ./e2e/...
&lt;span class="o"&gt;===&lt;/span&gt; RUN TestGetWeather
&lt;span class="o"&gt;===&lt;/span&gt; RUN TestGetWeather/San_Francisco
&lt;span class="o"&gt;===&lt;/span&gt; RUN TestGetWeather/New_York
&lt;span class="nt"&gt;---&lt;/span&gt; PASS: TestGetWeather &lt;span class="o"&gt;(&lt;/span&gt;0.00s&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="nt"&gt;---&lt;/span&gt; PASS: TestGetWeather/San_Francisco &lt;span class="o"&gt;(&lt;/span&gt;0.00s&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="nt"&gt;---&lt;/span&gt; PASS: TestGetWeather/New_York &lt;span class="o"&gt;(&lt;/span&gt;0.00s&lt;span class="o"&gt;)&lt;/span&gt;
PASS
ok my-first-go-agent/agent 0.218s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Deployment
&lt;/h3&gt;

&lt;p&gt;The make deploy command automatically builds your application from source using &lt;a href="https://docs.cloud.google.com/docs/buildpacks/overview?utm_campaign=CDR_0x2b6f3004_default_b476693958&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Google Cloud Buildpacks&lt;/a&gt;, triggered by the &lt;code&gt;--source .&lt;/code&gt; flag. It deploys this image to &lt;a href="https://cloud.google.com/run?utm_campaign=CDR_0x2b6f3004_default_b476693958&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Cloud Run&lt;/a&gt; with several production-optimized flags: &lt;code&gt;--memory “4Gi”&lt;/code&gt; to provide ample RAM for LLM operations, and &lt;code&gt;--no-cpu-throttling&lt;/code&gt; to ensure the CPU remains allocated 24/7. This configuration is particularly valuable for Go applications.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1qm9xk5x9jrhaute6w07.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1qm9xk5x9jrhaute6w07.png" width="800" height="362"&gt;&lt;/a&gt;&lt;/p&gt;
`make deploy` builds the container and deploys to Cloud Run



&lt;p&gt;To ensure your agent runs securely, the command is enabled with a strict configuration. It uses &lt;code&gt;--no-allow-unauthenticated&lt;/code&gt; to block all public access by default, requiring Identity and Access Management (IAM) authentication for any requests. It also injects environment variables via &lt;code&gt;--update-env-vars&lt;/code&gt;, including the use of Vertex AI &lt;code&gt;GOOGLE\_GENAI\_USE\_VERTEXAI=True&lt;/code&gt;. After running the command, I have a service URL!&lt;/p&gt;

&lt;p&gt;If you want to view the deployed web UI, I recommend deploying with &lt;code&gt;make deploy IAP=true&lt;/code&gt;. This will handle the steps to &lt;a href="https://docs.cloud.google.com/iap/docs/enabling-cloud-run?utm_campaign=CDR_0x2b6f3004_default_b476693958&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;enable IAP for Cloud Run&lt;/a&gt;. You will also need to &lt;a href="https://docs.cloud.google.com/run/docs/securing/identity-aware-proxy-cloud-run#manage_user_or_group_access?utm_campaign=CDR_0x2b6f3004_default_b476693958&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;provide access to users&lt;/a&gt; within your organization following the instructions in the documentation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8b68rswjmmkvg9r2qb47.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8b68rswjmmkvg9r2qb47.png" width="800" height="478"&gt;&lt;/a&gt;&lt;/p&gt;
Adding a principal to IAP with the Google Cloud Console



&lt;p&gt;With IAP enabled, I can now view the web UI or the deployed &lt;a href="https://google.github.io/adk-docs/a2a/quickstart-consuming/#look-out-for-the-required-agent-card-agent-json-of-the-remote-agent" rel="noopener noreferrer"&gt;Agent Card&lt;/a&gt;. This card serves as your agent’s standard interface, allowing it to be dynamically discovered by other agents, orchestrators, or human-facing UI:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4wpxkxma9ehimsvjs3op.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4wpxkxma9ehimsvjs3op.png" width="800" height="514"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What’s next?
&lt;/h3&gt;

&lt;p&gt;To continue your journey building production AI agents in Golang:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://google.github.io/adk-docs/" rel="noopener noreferrer"&gt;&lt;strong&gt;ADK Documentation&lt;/strong&gt;&lt;/a&gt;: Complete guides on advanced patterns, multi-agent orchestration, and memory systems&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/GoogleCloudPlatform/agent-starter-pack" rel="noopener noreferrer"&gt;&lt;strong&gt;Agent Starter Pack&lt;/strong&gt;&lt;/a&gt;: Explore templates, including multi-agent systems and complex architectures&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://cloud.google.com/run/docs?utm_campaign=CDR_0x2b6f3004_default_b476693958&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;&lt;strong&gt;Cloud Run Documentation&lt;/strong&gt;&lt;/a&gt;: Deep dives on performance optimization, scaling strategies, and security best practices&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://go.dev/blog/pipelines" rel="noopener noreferrer"&gt;&lt;strong&gt;Go Concurrency Patterns&lt;/strong&gt;&lt;/a&gt;: Understanding goroutines and channels will help you build more efficient agent tooling&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.cloud.google.com/agent-builder/agent-engine/overview?utm_campaign=CDR_0x2b6f3004_default_b476693958&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;&lt;strong&gt;Vertex AI Agent Engine&lt;/strong&gt;&lt;/a&gt;: For managed agent infrastructure with built-in orchestration and tooling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As you scale from one agent to many, the engineering decisions we’ve discussed here compound in value. Go’s concurrency model and &lt;a href="https://cloud.google.com/run?utm_campaign=CDR_0x2b6f3004_default_b476693958&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Cloud Run’s&lt;/a&gt; autoscaling are both necessary ingredients. Share what you’re building with me on &lt;a href="https://www.linkedin.com/in/karlweinmeister/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;, &lt;a href="https://x.com/kweinmeister" rel="noopener noreferrer"&gt;X&lt;/a&gt;, or &lt;a href="https://bsky.app/profile/kweinmeister.bsky.social" rel="noopener noreferrer"&gt;Bluesky&lt;/a&gt;!&lt;/p&gt;




</description>
      <category>googlecloudrun</category>
      <category>agents</category>
      <category>go</category>
      <category>ai</category>
    </item>
    <item>
      <title>The Six Failures of Text-to-SQL (And How to Fix Them with Agents)</title>
      <dc:creator>Karl Weinmeister</dc:creator>
      <pubDate>Tue, 11 Nov 2025 14:23:12 +0000</pubDate>
      <link>https://dev.to/googleai/the-six-failures-of-text-to-sql-and-how-to-fix-them-with-agents-1n0a</link>
      <guid>https://dev.to/googleai/the-six-failures-of-text-to-sql-and-how-to-fix-them-with-agents-1n0a</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fslsqdjo6qycm1fbnelev.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fslsqdjo6qycm1fbnelev.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I’ve written countless SQL queries over the years. Unfortunately, like my golf game, I don’t write SQL enough to be a pro at it. Outside of straightforward SELECT statements, I approach SQL queries iteratively. I’ll inspect the tables, draft a query, and hope for the best. If there are any errors, I’ll go through this loop again.&lt;/p&gt;

&lt;p&gt;While AI models are much better than me at SQL, they aren’t perfect. And that loop I described is just as important for automated approaches to be effective. Text-to-SQL is a &lt;a href="https://arxiv.org/html/2410.01066v1" rel="noopener noreferrer"&gt;deceptively difficult problem&lt;/a&gt; with challenges including linguistic ambiguity and rare SQL operations.&lt;/p&gt;

&lt;p&gt;This is where a multi-agent architecture, built with a framework like Google’s Agent Development Kit (&lt;a href="https://google.github.io/adk-docs/" rel="noopener noreferrer"&gt;ADK&lt;/a&gt;), becomes essential. We can build a “virtual data analyst” by composing a team of specialized agents. A SchemaExtractor can find the right tables, a SqlGenerator can write the draft, and a SqlCorrector can critique and fix it. A SequentialAgent acts as the manager, ensuring the process is followed, every single time.&lt;/p&gt;

&lt;p&gt;In this guide, we’ll walk through the six most common failure points for Text-to-SQL and show how to solve each one by building out our team of agents, moving from a simple script to a full-fledged agentic system. We’ll use the sample project &lt;a href="https://github.com/kweinmeister/text-to-sql-agent" rel="noopener noreferrer"&gt;kweinmeister/text-to-sql-agent&lt;/a&gt; to illustrate these solutions.&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/-Vwd_9Lai38"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 1: Agent Order Issues
&lt;/h3&gt;

&lt;p&gt;Here’s the issue with a single &lt;a href="https://google.github.io/adk-docs/agents/llm-agents/" rel="noopener noreferrer"&gt;LlmAgent&lt;/a&gt; that holds all the tools: &lt;em&gt;it&lt;/em&gt; decides the order of operations. It might confidently skip fetching the schema and invent a table name. Or it might try to run a query &lt;em&gt;before&lt;/em&gt; validating it. A single LLM is deciding what to do next, and it can (and will) make mistakes. That’s not a reliable process.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution: SequentialAgent for Order Control
&lt;/h3&gt;

&lt;p&gt;The ADK gives us “&lt;a href="https://google.github.io/adk-docs/agents/workflow-agents/" rel="noopener noreferrer"&gt;Workflow Agents&lt;/a&gt;” for this. These specialized agents don’t use an LLM for flow control. They’re deterministic.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://google.github.io/adk-docs/agents/workflow-agents/sequential-agents/" rel="noopener noreferrer"&gt;SequentialAgent&lt;/a&gt; is the simplest and most powerful one to start with. It runs its sub-agents in the &lt;em&gt;exact&lt;/em&gt; order you list them. Using a sequential agent also separates the concerns of “what to do” (our specialized agents) from “the order to do it in” (the workflow agent).&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://google.github.io/adk-docs/agents/workflow-agents/sequential-agents/" rel="noopener noreferrer"&gt;SequentialAgent&lt;/a&gt; also acts as a guardrail. It turns our best practices (“always get the schema first,” “always validate before running”) into enforced infrastructure, not just suggestions in a prompt.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://github.com/kweinmeister/text-to-sql-agent/blob/main/src/texttosql/agent.py" rel="noopener noreferrer"&gt;Code Example&lt;/a&gt;: Defining the Workflow Manager
&lt;/h3&gt;

&lt;p&gt;Let’s define our root agent. Instead of a single LlmAgent, our root_agent will be a SequentialAgent. We’ll start by defining the &lt;em&gt;specialists&lt;/em&gt; as stubs (we’ll build them out in the next sections):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.adk.agents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SequentialAgent&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;.agents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;schema_extractor_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sql_correction_loop&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sql_generator_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;.callbacks&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;capture_user_message&lt;/span&gt;

&lt;span class="n"&gt;root_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SequentialAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TextToSqlRootAgent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;before_agent_callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;capture_user_message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sub_agents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;schema_extractor_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;sql_generator_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;sql_correction_loop&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Problem 2: LLM Schema Hallucinations
&lt;/h3&gt;

&lt;p&gt;This is the classic failure mode. The LLM just doesn’t know your schema.&lt;/p&gt;

&lt;p&gt;A common but flawed fix is to dump the &lt;em&gt;entire&lt;/em&gt; database schema into the prompt. This backfires for two reasons. First, huge enterprise schemas won’t even fit in the context window. Second, even if they did, giving the LLM 100 irrelevant tables to find the 2 relevant ones just drowns it in noise and leads to &lt;em&gt;worse&lt;/em&gt; results.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution: Dedicated Schema-Retrieval Tool
&lt;/h3&gt;

&lt;p&gt;The answer is dynamic retrieval. Don’t give the agent a static block of schema; give it a &lt;em&gt;tool&lt;/em&gt; to &lt;em&gt;fetch&lt;/em&gt; schema. This lets the LLM reason about what it needs first, and &lt;em&gt;then&lt;/em&gt; request only that specific information.&lt;/p&gt;

&lt;p&gt;We can build a simple Python function for this. The ADK makes it easy to turn any function into an agent-callable tool with &lt;a href="https://google.github.io/adk-docs/tools/function-tools/" rel="noopener noreferrer"&gt;FunctionTool&lt;/a&gt;. The agent automatically figures out how to use it from its docstring, a best practice you’ll see in projects like &lt;a href="https://medium.com/@gabi.preda/building-agentic-applications-with-googles-adk-a-hands-on-sql-agent-example-8b30d888293f" rel="noopener noreferrer"&gt;gabrielpreda/adk-sql-agent&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://github.com/kweinmeister/text-to-sql-agent/blob/main/src/texttosql/tools.py" rel="noopener noreferrer"&gt;Code Example&lt;/a&gt;: The Schema Tool
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;💡 In the&lt;/em&gt; &lt;a href="https://github.com/kweinmeister/text-to-sql-agent" rel="noopener noreferrer"&gt;&lt;em&gt;kweinmeister/text-to-sql-agent&lt;/em&gt;&lt;/a&gt; &lt;em&gt;project, the functions are not wrapped as&lt;/em&gt; &lt;a href="https://cloud.google.com/blog/topics/developers-practitioners/tools-make-an-agent-from-zero-to-assistant-with-adk?utm_campaign=CDR_0x2b6f3004_default_b459252462&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;&lt;em&gt;tools&lt;/em&gt;&lt;/a&gt;&lt;em&gt;, since they are directly called by a deterministic agent. They are provided centrally in a&lt;/em&gt; &lt;em&gt;tools.py file, so that they can be easily leveraged as tools in a future&lt;/em&gt; &lt;a href="https://google.github.io/adk-docs/agents/llm-agents/" rel="noopener noreferrer"&gt;&lt;em&gt;LlmAgent&lt;/em&gt;&lt;/a&gt;&lt;em&gt;.&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;.config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DB_URI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;.dialects.dialect&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DatabaseDialect&lt;/span&gt;

&lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_schema_into_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;dialect&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DatabaseDialect&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Loads the DDL and SQLGlot schema into the state dictionary.
    This function relies on the caching mechanism within the dialect object.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Loading schema for dialect: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;dialect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;db_uri&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DB_URI&lt;/span&gt;
    &lt;span class="c1"&gt;# Error handling code omitted
&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Loading schema from database: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;db_uri&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# The dialect object handles its own caching.
&lt;/span&gt;        &lt;span class="c1"&gt;# The first call to get_ddl will trigger the DB query and cache the DDL.
&lt;/span&gt;        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Calling dialect.get_ddl...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;schema_ddl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dialect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_ddl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db_uri&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DDL loaded successfully&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# The call to get_sqlglot_schema will use the cached DDL if available,
&lt;/span&gt;        &lt;span class="c1"&gt;# then parse it and cache the result.
&lt;/span&gt;        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Calling dialect.get_sqlglot_schema...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sqlglot_schema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dialect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_sqlglot_schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db_uri&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SQLGlot schema loaded successfully&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SQLGlot schema keys: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sqlglot_schema&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;error_msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error extracting schema: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;error_msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exc_info&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;schema_ddl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error loading schema: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;error_msg&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sqlglot_schema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;error_msg&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Problem 3: Query Logic Errors
&lt;/h3&gt;

&lt;p&gt;Even with the right schema, the LLM can still make logical mistakes with complex joins or aggregations. A human analyst would spot the error, critique it (“That join is wrong, you need to use user_id”), and refine it.&lt;/p&gt;

&lt;p&gt;Our &lt;a href="https://google.github.io/adk-docs/api-reference/python/google.adk.agents.html#google.adk.agents.SequentialAgent" rel="noopener noreferrer"&gt;SequentialAgent&lt;/a&gt; is too simple for this. It’s a waterfall. It can’t go backwards and iterate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution: LoopAgent for Iterative Refinement
&lt;/h3&gt;

&lt;p&gt;The ADK has another workflow agent for this: the &lt;a href="https://google.github.io/adk-docs/agents/workflow-agents/loop-agents/" rel="noopener noreferrer"&gt;LoopAgent&lt;/a&gt;. This agent runs its sub-agents &lt;em&gt;iteratively&lt;/em&gt; until a condition is met. It’s perfect for a “generate-and-critique” pattern.&lt;/p&gt;

&lt;p&gt;We don’t have to replace our SequentialAgent. We can enhance it by &lt;a href="https://medium.com/@shins777/adk-workflow-the-core-logic-of-ai-agent-8ce4be5c1c40" rel="noopener noreferrer"&gt;&lt;strong&gt;nesting workflow agents&lt;/strong&gt;&lt;/a&gt;. We’ll replace the single query generation step inside our SequentialAgent with a new LoopAgent. This loop will contain a team of two specialists:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A Writer Agent:&lt;/strong&gt; An LlmAgent that writes the SQL draft.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A Critic Agent:&lt;/strong&gt; A &lt;em&gt;second LlmAgent&lt;/em&gt; with a different prompt, whose only job is to correct the writer’s SQL.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is a powerful way to get LLMs to self-correct, which improves the quality of the final query.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://github.com/kweinmeister/text-to-sql-agent/blob/main/src/texttosql/agents.py" rel="noopener noreferrer"&gt;Code Example&lt;/a&gt;: Building a “Generate-and-Critique” Loop
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sql_generator_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sql_generator_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MODEL_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generates an initial SQL query from a natural language question.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;get_generator_instruction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sql_query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;after_model_callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;clean_sql_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;sql_corrector_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sql_corrector_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MODEL_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Corrects a failed SQL query.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;get_corrector_instruction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sql_query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;
    &lt;span class="n"&gt;after_model_callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;clean_sql_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;sql_correction_loop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LoopAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SQLCorrectionLoop&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sub_agents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;sql_processor_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;sql_corrector_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;max_iterations&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Problem 4: Agent Performance and Cost
&lt;/h3&gt;

&lt;p&gt;We’re now using three LLM-powered agents. This is great for quality, but it’s slow and costs money with every API call.&lt;/p&gt;

&lt;p&gt;What about simple, deterministic steps? Things like validating SQL syntax, formatting data, or cleaning up LLM output. Using a powerful LLM for these jobs is like using a sledgehammer to hang a picture. It’s slow, expensive, and surprisingly unreliable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution: Custom Agents for Code-Based Logic
&lt;/h3&gt;

&lt;p&gt;The ADK isn’t just for LLMs. You can create a “&lt;a href="https://google.github.io/adk-docs/agents/custom-agents/" rel="noopener noreferrer"&gt;Custom Agent&lt;/a&gt;” by inheriting from &lt;a href="https://google.github.io/adk-docs/api-reference/python/google-adk.html#google.adk.agents.BaseAgent" rel="noopener noreferrer"&gt;BaseAgent&lt;/a&gt; and implementing the _run_async_impl method.&lt;/p&gt;

&lt;p&gt;This agent has no LLM. It runs pure Python code. It’s fast and 100% deterministic. We’ll create a custom agent for our next problem: validation.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://github.com/kweinmeister/text-to-sql-agent/blob/main/src/texttosql/agents.py" rel="noopener noreferrer"&gt;Code Example&lt;/a&gt;: Building a Non-LLM ValidationAgent
&lt;/h3&gt;

&lt;p&gt;This agent will use the sqlglot library (which we’ll discuss in detail next) and will be a custom BaseAgent.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SQLProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseAgent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Agent that handles the mechanical steps of:
    1. Validating the current SQL.
    2. Executing it ONLY if validation passed.
    3. Escalating to exit the loop on successful execution.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_run_async_impl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;InvocationContext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AsyncGenerator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] Starting SQL processing.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# ...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Problem 5: Dangerous Query Execution
&lt;/h3&gt;

&lt;p&gt;This is the big one. You can’t execute LLM-generated code directly against your database. Ever. It’s a massive security and stability risk.&lt;/p&gt;

&lt;p&gt;We need a fast, reliable check for syntax errors. What if the LLM produces a query that’s syntactically invalid? Or for the wrong SQL dialect?&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution: Non-Destructive Dry Run with sqlglot
&lt;/h3&gt;

&lt;p&gt;This is where our custom SqlValidationAgent shines. We’ll use the &lt;a href="https://github.com/tobymao/sqlglot" rel="noopener noreferrer"&gt;sqlglot&lt;/a&gt; library, a pure-Python SQL parser and transpiler.&lt;/p&gt;

&lt;p&gt;Why sqlglot? It’s fast and local, building a real Abstract Syntax Tree (AST) which is infinitely more reliable than regex. It’s also dialect-aware, so it can catch syntax errors specific to, say, PostgreSQL.&lt;/p&gt;

&lt;p&gt;We can just wrap sqlglot.parse_one(sql) in a try…except block. If it parses, the syntax is valid. If it throws a ParseError, it’s not. This gives us a fast and cheap validation signal.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://github.com/kweinmeister/text-to-sql-agent/blob/main/src/texttosql/agents.py" rel="noopener noreferrer"&gt;Code Example&lt;/a&gt;: Full ValidationAgent Implementation
&lt;/h3&gt;

&lt;p&gt;Here is the full implementation of the SqlValidationAgent we previewed with sqlglot validation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.adk.agents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseAgent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.adk.core&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;InvocationContext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Event&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.genai.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Part&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sqlglot&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sqlglot.expressions&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;exp&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AsyncGenerator&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SQLProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseAgent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Agent that handles the mechanical steps of:
    1. Validating the current SQL.
    2. Executing it ONLY if validation passed.
    3. Escalating to exit the loop on successful execution.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_run_async_impl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;InvocationContext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AsyncGenerator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] Starting SQL processing.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;
        &lt;span class="n"&gt;dialect&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_dialect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="n"&gt;val_result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_sql_validation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dialect&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="nc"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;author&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;invocation_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;invocation_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;custom_metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validation_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;val_result&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;val_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;exec_result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_sql_execution&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dialect&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="n"&gt;result_event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;author&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;invocation_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;invocation_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;custom_metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;execution_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;exec_result&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c1"&gt;# If execution succeeds, this is the final answer.
&lt;/span&gt;            &lt;span class="c1"&gt;# Escalate to exit the loop and provide the final content.
&lt;/span&gt;            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;exec_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] SQL execution successful. Escalating to exit loop.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;result_event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;actions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;escalate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

                &lt;span class="n"&gt;final_query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sql_query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;final_sql_query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;final_query&lt;/span&gt;

                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;final_query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;result_event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                        &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;Part&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;final_query&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
                    &lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;result_event&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] Skipping execution due to validation failure.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;execution_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skipped&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validation_failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Problem 6: Messy LLM Output
&lt;/h3&gt;

&lt;p&gt;One last thing. LLMs are trained to be helpful conversationalists. So when you ask for a SQL query, you often get this:&lt;/p&gt;

&lt;p&gt;“Sure! Here is the SQL query you asked for: SELECT * FROM users;”&lt;/p&gt;

&lt;p&gt;That conversational fluff will break our SqlValidationAgent every single time. We need a way to programmatically clean the LLM’s output &lt;em&gt;before&lt;/em&gt; it’s passed to the next agent.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution: Callbacks for Post-Processing
&lt;/h3&gt;

&lt;p&gt;We could add another CustomAgent just to strip the text, but that feels a bit heavy for such a simple task.&lt;/p&gt;

&lt;p&gt;The ADK offers a more elegant solution: &lt;strong&gt;Callbacks&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;An &lt;a href="https://google.github.io/adk-docs/callbacks/types-of-callbacks/#after-agent-callback" rel="noopener noreferrer"&gt;AfterAgentCallback&lt;/a&gt; is a function you attach to an agent that’s guaranteed to run immediately after the agent finishes. It can even modify the agent’s final output.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://github.com/kweinmeister/text-to-sql-agent/blob/main/src/texttosql/callbacks.py" rel="noopener noreferrer"&gt;Code Example&lt;/a&gt;: Attaching a Cleanup Callback
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.adk.core&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;InvocationContext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Content&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;cleanup_sql_output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;InvocationContext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;agent_output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Content&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    This callback runs *after* the agent and cleans its output.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;raw_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent_output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

    &lt;span class="c1"&gt;# Simple regex to find content within ```
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endraw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;match = re.search(r"```
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
sql\s*(.&lt;em&gt;?)\s&lt;/em&gt;&lt;br&gt;
&lt;br&gt;
```", raw_text, re.DOTALL | re.IGNORECASE)&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cleaned_text = raw_text
if match:
    cleaned_text = match.group(1)
else:
    # Fallback: simple stripping
    cleaned_text = raw_text.strip().strip("`").strip()

# Add a semicolon if it's missing (another common cleanup)
if not cleaned_text.endswith(";"):
    cleaned_text += ";"

# Return a *new* Content object to *replace* the original output
return Content.from_text(cleaned_text)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


### Final Architecture

We’ve systematically tackled the six hardest problems in Text-to-SQL, evolving a brittle script into an extensible multi-agent system.

Our final root\_agent is a [SequentialAgent](https://google.github.io/adk-docs/api-reference/python/google.adk.agents.html#google.adk.agents.SequentialAgent) that orchestrates a team of specialists: a schema-fetching agent, a looping agent for iterative query improvement (with its own writer and critic), and a fast, deterministic validation agent using sqlglot.

The point is that modern agent development is about _composition_. You have to choose the right ADK construct for the right task. This table is a cheat sheet for making that decision.

### Agent Design: The “Right Tool for the Job”

![](https://cdn-images-1.medium.com/max/1024/1*OuHQ_Jb7kQ0IaBrlWKVICg.png)

### Conclusion: Building Reliable AI Systems

This pattern of **Specialization** , **Orchestration** , and **Safeguards** is the future of building production-ready AI. It’s not just for SQL, either. You can use this same architecture for autonomous code generation, document analysis, and much more.

So stop trying to build one “super-prompt” and start building teams of specialized agents. Welcome to the world of reliable, agentic systems.

What’s next? Get started in 3 simple steps in the [sample repository](https://github.com/kweinmeister/text-to-sql-agent). If you want a hands-on lab exercise, check out [Build Multi-Agent Systems with ADK](https://codelabs.developers.google.com/codelabs/production-ready-ai-with-gc/3-developing-agents/build-a-multi-agent-system-with-adk?hl=en#0&amp;amp;utm_campaign=CDR_0x2b6f3004_default_b459252462&amp;amp;utm_medium=external&amp;amp;utm_source=blog). To learn about powerful, built-in natural language capabilities in AlloyDB, try out the [AlloyDB AI NL SQL](https://codelabs.developers.google.com/alloydb-ai-nl-sql?hl=en#0&amp;amp;utm_campaign=CDR_0x2b6f3004_default_b459252462&amp;amp;utm_medium=external&amp;amp;utm_source=blog) codelab.

Want to keep the discussion going about multi-agent systems? Connect with me on [LinkedIn](https://www.linkedin.com/in/karlweinmeister/), [X](https://x.com/kweinmeister), or [Bluesky](https://bsky.app/profile/kweinmeister.bsky.social).

* * *
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>googleadk</category>
      <category>relationaldatabases</category>
      <category>sql</category>
      <category>agents</category>
    </item>
    <item>
      <title>Deploy Faster with Terraform: Your Guide to vLLM on GKE with Infrastructure-as-Code</title>
      <dc:creator>Karl Weinmeister</dc:creator>
      <pubDate>Sun, 12 Oct 2025 23:24:44 +0000</pubDate>
      <link>https://dev.to/googleai/deploy-faster-with-terraform-your-guide-to-vllm-on-gke-with-infrastructure-as-code-6jh</link>
      <guid>https://dev.to/googleai/deploy-faster-with-terraform-your-guide-to-vllm-on-gke-with-infrastructure-as-code-6jh</guid>
      <description>&lt;p&gt;Somewhere in your AI journey, you’re going to push the limits of what models can do.&lt;/p&gt;

&lt;p&gt;You might need to squeeze out that extra bit of performance, or try to fit a big model right under a GPU’s VRAM limit. All of these situations require tweaking and redeployment. That’s not as simple as it sounds, when the infrastructure includes everything from GPU clusters to storage to networking.&lt;/p&gt;

&lt;p&gt;The solution is to treat your infrastructure the same way you treat your application code. It needs to be versioned in Git. It needs to be tested. And it needs to be deployed through an automated pipeline. This practice, known as &lt;a href="https://cloud.google.com/docs/iac?utm_campaign=CDR_0x2b6f3004_user-journey_b450531330&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Infrastructure as Code&lt;/a&gt;, or IaC, is the foundation of any serious MLOps strategy.&lt;/p&gt;

&lt;p&gt;This article is a practical guide on how to use Terraform for agile ML engineering. I’ll walk through a real-world example of deploying a high-with &lt;a href="https://docs.vllm.ai/en/stable/" rel="noopener noreferrer"&gt;vLLM&lt;/a&gt; on &lt;a href="https://cloud.google.com/kubernetes-engine?utm_campaign=CDR_0x2b6f3004_user-journey_b450531330&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Google Kubernetes Engine&lt;/a&gt;. You can follow along with the complete source code on GitHub in the &lt;a href="https://github.com/kweinmeister/vllm-gke-terraform" rel="noopener noreferrer"&gt;vllm-gke-terraform&lt;/a&gt; repository.&lt;/p&gt;

&lt;p&gt;We will use the &lt;a href="https://huggingface.co/Qwen/Qwen3-32B" rel="noopener noreferrer"&gt;Qwen3–32B&lt;/a&gt; model in this article, which can be run on easily accessible &lt;a href="https://cloud.google.com/blog/products/compute/introducing-g2-vms-with-nvidia-l4-gpus?utm_campaign=CDR_0x2b6f3004_user-journey_b450531330&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;NVIDIA L4 GPUs&lt;/a&gt; on Google Cloud. The Terraform script has been tested on larger models, such as the &lt;a href="https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507" rel="noopener noreferrer"&gt;Qwen/Qwen3–235B-A22B-Instruct-2507&lt;/a&gt; on a cluster with 8 H100 GPUs.&lt;/p&gt;

&lt;p&gt;The scripts currently use GKE standard clusters for maximum flexibility. For production workloads where you want to offload node management and focus purely on the application, it’s recommended to leverage GKE &lt;a href="https://cloud.google.com/blog/products/containers-kubernetes/gke-autopilot-now-available-to-all-qualifying-clusters?utm_campaign=CDR_0x2b6f3004_user-journey_b450531330&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Autopilot&lt;/a&gt; capabilities.&lt;/p&gt;

&lt;h4&gt;
  
  
  Declarative Infrastructure
&lt;/h4&gt;

&lt;p&gt;Terraform uses a declarative language (&lt;a href="https://developer.hashicorp.com/terraform/language/syntax/configuration" rel="noopener noreferrer"&gt;HCL&lt;/a&gt;) where you define the desired end state of your infrastructure. You specify what you need, and Terraform’s engine calculates the necessary API calls to make the real-world infrastructure match that state. Before applying any changes, you can run the terraform plan command to see a detailed preview of what Terraform will create, modify, or destroy.&lt;/p&gt;

&lt;p&gt;This allows for a thorough review to ensure the proposed changes align with your intentions, preventing unintended modifications. This declarative model is the key to eliminating configuration drift and ensuring that every environment is provisioned identically, a critical requirement for reproducible experiments.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://registry.terraform.io/providers/hashicorp/google/latest/docs" rel="noopener noreferrer"&gt;Terraform provider for Google Cloud&lt;/a&gt; is the interface between Terraform and Google Cloud. For example, the &lt;a href="https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster" rel="noopener noreferrer"&gt;google_container_cluster&lt;/a&gt; resource is used to manage a GKE cluster. You can find the full set of GKE resources &lt;a href="https://cloud.google.com/kubernetes-engine/docs/terraform?utm_campaign=CDR_0x2b6f3004_user-journey_b450531330&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In our project, the &lt;a href="https://github.com/kweinmeister/vllm-gke-terraform/blob/main/gke.tf" rel="noopener noreferrer"&gt;gke.tf&lt;/a&gt; file declares the desired state of a GKE cluster with specific node pools:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# gke.tf
resource "google_container_cluster" "qwen_cluster" {
  name = local.cluster_name
  location = var.zone
  project = var.project_id
  # ...
}

resource "google_container_node_pool" "gpu_pools" {
  # ...
  node_config {
    machine_type = each.value.machine_type
    guest_accelerator {
      type = each.value.accelerator_type
      count = each.value.accelerator_count
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To manage this, Terraform maintains a state file that maps these definitions to their real-world resources. For team collaboration, using a remote state backend like &lt;a href="https://cloud.google.com/storage?utm_campaign=CDR_0x2b6f3004_user-journey_b450531330&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Cloud Storage&lt;/a&gt; is recommended. It provides a centralized source of truth and uses locking mechanisms to prevent conflicting changes. Here’s how to instruct Terraform to use GCS as its backend:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# main.tf
terraform {
  backend "gcs" {
    prefix = "terraform/state/vllm-gke"
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Reusable Modules
&lt;/h4&gt;

&lt;p&gt;Terraform modules are the primary mechanism for abstraction and reuse. MLOps teams can create a library of standardized modules for common components like a GKE cluster or a vector database.&lt;/p&gt;

&lt;p&gt;Modules are made reusable through input variables. This allows an engineer to maintain a single, version-controlled set of Terraform files and use variable files (.tfvars) to launch new, isolated deployments.&lt;/p&gt;

&lt;p&gt;To test a new model, you could simply create a new variable file like llama3-test.tfvars. By overriding a few default values, you can spin up an entirely new, isolated environment to test Llama-3–8B on L4 GPUs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# my-experiment.tfvars
project_id = "my-gcp-project"
name_prefix = "my-llama3-deployment"
model_id = "meta-llama/Llama-3-8B-Instruct"
gpu_type = "l4"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running terraform apply -var-file=llama3-test.tfvars makes spinning up parallel experiments a trivial, declarative operation, dramatically increasing a team’s experimental throughput.&lt;/p&gt;

&lt;p&gt;For production systems, this same principle allows for sophisticated, zero-downtime strategies like Blue/Green deployments. A second, parallel “green” version of the entire stack is deployed by instantiating the Terraform configuration with a different set of variables. Once the new environment is fully validated, production traffic can be instantly switched at the load balancer or DNS level. The old “blue” environment can then be decommissioned. By codifying these complex release strategies, the entire deployment process becomes a version-controlled, auditable artifact.&lt;/p&gt;

&lt;h4&gt;
  
  
  Configuring the vLLM Engine
&lt;/h4&gt;

&lt;p&gt;Provisioning hardware consistently is the first step. Configuring software to utilize that hardware efficiently is next.&lt;/p&gt;

&lt;p&gt;The sample project uses the popular &lt;a href="https://docs.vllm.ai/en/stable/" rel="noopener noreferrer"&gt;vLLM&lt;/a&gt; inference engine. Let’s show how to effectively link Terraform variables to configuration parameters in vLLM.&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://github.com/kweinmeister/vllm-gke-terraform/blob/main/variables.tf" rel="noopener noreferrer"&gt;variables.tf&lt;/a&gt;, the high-level knobs for experiments are defined:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# variables.tf
variable "gpu_memory_utilization" {
  description = "GPU memory utilization ratio"
  type = number
  default = 0.9
}

variable "max_model_len" {
  description = "The maximum model length."
  type = number
  default = 8192
}

variable "vllm_max_num_seqs" {
  description = "The maximum number of sequences (requests) to batch together."
  type = number
  default = 64
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, the deployment in kubernetes.tf consumes these variables to construct the vLLM server’s startup arguments:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# kubernetes.tf
...
container {
  name = "vllm-container"
  args = compact([
    # --- Base Model Arguments ---
    "--model",
    var.model_id,
    "--tensor-parallel-size",
    tostring(local.gpu_config.accelerator_count),

    # --- Performance Tuning from Variables ---
    "--gpu-memory-utilization",
    tostring(var.gpu_memory_utilization),
    "--max-model-len",
    tostring(var.max_model_len),
    "--max-num-seqs",
    tostring(var.vllm_max_num_seqs),
  ])
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Production-Grade Architecture
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://github.com/kweinmeister/vllm-gke-terraform" rel="noopener noreferrer"&gt;sample project&lt;/a&gt; showcases a blueprint for a production-grade inference endpoint on GKE designed for both performance and cost-efficiency.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2x3aembmosmnkoiypgxx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2x3aembmosmnkoiypgxx.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/kweinmeister/vllm-gke-terraform/blob/main/gke.tf" rel="noopener noreferrer"&gt;gke.tf&lt;/a&gt; file provisions a GKE cluster with both &lt;a href="https://cloud.google.com/kubernetes-engine/docs/concepts/spot-vms?utm_campaign=CDR_0x2b6f3004_user-journey_b450531330&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;spot&lt;/a&gt; and on-demand GPU node pools, which allows for a flexible and cost-effective approach to managing expensive GPU resources. You can read more &lt;a href="https://cloud.google.com/blog/topics/developers-practitioners/running-gke-application-spot-nodes-demand-nodes-fallback?utm_campaign=CDR_0x2b6f3004_user-journey_b450531330&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;here&lt;/a&gt; about the strategy to back up spot VMs with an on-demand node pool.&lt;/p&gt;

&lt;p&gt;To avoid re-downloading large models on every pod restart, a &lt;a href="https://cloud.google.com/kubernetes-engine/docs/concepts/persistent-volumes?utm_campaign=CDR_0x2b6f3004_user-journey_b450531330&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;kubernetes_persistent_volume_claim&lt;/a&gt; is created in &lt;a href="https://github.com/kweinmeister/vllm-gke-terraform/blob/main/kubernetes.tf" rel="noopener noreferrer"&gt;kubernetes.tf&lt;/a&gt; to provide a persistent cache for the Hugging Face models. A Kubernetes Job, defined in &lt;a href="https://github.com/kweinmeister/vllm-gke-terraform/blob/main/kubernetes_jobs.tf" rel="noopener noreferrer"&gt;kubernetes_jobs.tf&lt;/a&gt;, is then used to download the specified model into this persistent volume. This job runs to completion before the main vLLM deployment is scaled up, ensuring the model is ready before the inference server starts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Automated Workflows
&lt;/h3&gt;

&lt;p&gt;While Terraform itself is a big leap forward from shell scripting, it’s crucial that teams don’t stop there. The next step beyond running manual terraform commands is to embrace an automated, end-to-end CI/CD workflow, often called GitOps. The source control repository becomes the single source of truth for both application code and infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fljoz3vquz97smpgo3ise.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fljoz3vquz97smpgo3ise.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The sample project includes a basic GitHub Actions workflow that validates the Terraform code on every push and pull request.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# .github/workflows/terraform-validate.yml
name: 'Terraform Validate'
on: [push, pull_request]

jobs:
  validate:
    name: 'Terraform Validate'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform fmt -check -recursive
      - run: terraform init -backend=false
      - run: terraform validate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A complete CI/CD pipeline would extend this by running terraform plan on pull requests to preview changes and automatically running terraform apply on merge to the main branch to deploy them. This creates a flywheel where code is pushed and infrastructure is updated without manual intervention.&lt;/p&gt;

&lt;h3&gt;
  
  
  Infrastructure-as-Code is a now an AI Competency
&lt;/h3&gt;

&lt;p&gt;The main takeaway is this: mastering &lt;a href="https://cloud.google.com/docs/iac?utm_campaign=CDR_0x2b6f3004_user-journey_b450531330&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Infrastructure as Code&lt;/a&gt; isn’t an optional “DevOps” skill. It’s a core competency for the modern ML engineer. For any organization serious about productionizing AI, &lt;a href="https://cloud.google.com/docs/terraform?utm_campaign=CDR_0x2b6f3004_user-journey_b450531330&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Terraform on Google Cloud&lt;/a&gt; is the a key step toward building a scalable engineering culture.&lt;/p&gt;

&lt;p&gt;If you’d like to keep learning more, I recommend the step-by-step guide on using a GKE cluster with Terraform: &lt;a href="https://cloud.google.com/kubernetes-engine/docs/quickstarts/create-cluster-using-terraform?utm_campaign=CDR_0x2b6f3004_user-journey_b450531330&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Quickstart: Deploy a workload with Terraform&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;From there, I’d love to hear more about your journey with AI and Cloud infrastructure. Connect on &lt;a href="https://www.linkedin.com/in/karlweinmeister/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;, &lt;a href="https://x.com/kweinmeister" rel="noopener noreferrer"&gt;X&lt;/a&gt;, or &lt;a href="https://bsky.app/profile/kweinmeister.bsky.social" rel="noopener noreferrer"&gt;Bluesky&lt;/a&gt; to continue the discussion!&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/qXsAJhIlV9E"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;




</description>
      <category>vllm</category>
      <category>gke</category>
      <category>terraform</category>
      <category>ai</category>
    </item>
    <item>
      <title>A Developer’s Guide to Model Routing</title>
      <dc:creator>Karl Weinmeister</dc:creator>
      <pubDate>Mon, 25 Aug 2025 16:26:04 +0000</pubDate>
      <link>https://dev.to/kweinmeister/a-developers-guide-to-model-routing-85m</link>
      <guid>https://dev.to/kweinmeister/a-developers-guide-to-model-routing-85m</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4qdnw2dr0rvhbqfq2ntb.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4qdnw2dr0rvhbqfq2ntb.jpeg" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Not long ago, building with LLMs meant picking one general-purpose model and sticking with it. Today, the landscape is flooded with thousands of options: large and small, open and closed-source, generalist and specialist, each with unique capabilities and costs.&lt;/p&gt;

&lt;p&gt;This explosion of choice has fundamentally changed how we build AI applications. The one-size-fits-all approach is over.&lt;/p&gt;

&lt;p&gt;Instead, we architect systems that select the best model for each task. This is the idea behind model routing. This architectural pattern can be implemented today, and has the potential to change the economics of model inference. Let’s get into it!&lt;/p&gt;

&lt;h3&gt;
  
  
  Understanding Model Routing
&lt;/h3&gt;

&lt;p&gt;As a developer building with LLMs, you’re constantly juggling three competing priorities: performance, cost, and latency.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Performance (Quality):&lt;/strong&gt; For complex reasoning and creative generation, you might reach for state-of-the-art models like Google’s &lt;a href="https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-pro?utm_campaign=CDR_0x2b6f3004_user-journey_b440933914&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Gemini 2.5 Pro&lt;/a&gt;. These models deliver high-quality, accurate responses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost:&lt;/strong&gt; While premium models deliver state-of-the-art performance, they represent a significant investment. The key to a sustainable AI strategy is to reserve these powerful models for tasks where their advanced capabilities provide a clear return on investment. For more routine queries, smaller, highly efficient models can deliver excellent results at a fraction of the cost. Recent studies show this approach can yield &lt;a href="https://lmsys.org/blog/2024-07-01-routellm/" rel="noopener noreferrer"&gt;cost savings&lt;/a&gt; without significantly degrading performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency:&lt;/strong&gt; In interactive applications like chatbots, a fast response time is critical for a positive user experience. Smaller, specialized models can deliver near-instantaneous responses, making them ideal for real-time, conversational AI. By routing interactive queries to these faster models, you can create a more engaging and responsive application.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Relying on a single model forces an unnecessary compromise. Use a top-tier model for everything, and you pay a premium for power you don’t always need. Use a smaller model for everything, and you sacrifice quality on complex queries. So why are we still forcing ourselves to choose just one?&lt;/p&gt;

&lt;p&gt;Model routing is an architectural pattern designed to solve this optimization problem. It involves maintaining a pool of candidate LLMs and routing each incoming prompt to the most suitable model. That’s often the smallest, fastest, and most cost-effective model that can successfully complete the task.&lt;/p&gt;

&lt;h3&gt;
  
  
  Common Routing Patterns
&lt;/h3&gt;

&lt;p&gt;Implementing a model router involves choosing an architectural pattern that determines how routing decisions are made. These patterns exist on a spectrum of complexity and intelligence, from simple, predefined rules to sophisticated, AI-driven classification. We will focus on dynamic routing patterns that assess the content, intent, and complexity of the prompt to select the optimal model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rule-Based Routing
&lt;/h3&gt;

&lt;p&gt;This is the simplest form of dynamic routing. It uses hard-coded logic, typically a series of if/else statements, to make routing decisions based on simple characteristics of the prompt.&lt;/p&gt;

&lt;p&gt;The rules are based on easily measurable attributes of the prompt, such as the presence of certain keywords, its overall length, or matches against regular expressions. For instance, a system might check for specific terms to identify a task category or measure the prompt’s length to estimate its complexity.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pros:&lt;/strong&gt; This approach is predictable, transparent, and fast to execute. It’s an excellent choice for well-defined, simple workflows where task categories can be reliably distinguished by straightforward heuristics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; Rule-based systems are brittle and inflexible because they lack a true understanding of language. They can be easily confused by semantic nuance, such as negation or context. The system also becomes difficult to maintain and scale as the number of rules grows.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  LLM-Based Routing
&lt;/h3&gt;

&lt;p&gt;This pattern leverages the intelligence of an LLM to perform the routing task itself. A dedicated, often smaller and faster, “router LLM” acts as a classification engine.&lt;/p&gt;

&lt;p&gt;The user’s prompt is fed into the router LLM. The router LLM is given a prompt that instructs it to analyze the query and classify it into predefined categories. To ensure the output is machine-readable, the router LLM is instructed to respond in a structured format like JSON. The application then parses this JSON output to determine which model to call next.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pros:&lt;/strong&gt; This is a powerful and flexible approach. The router LLM can understand complex, ambiguous, and nuanced language. It can handle multi-intent queries and can be adapted to new routing tasks simply by updating its system prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; The primary drawback is significant overhead. This method introduces an additional, full LLM API call into the critical path of every request. This adds both cost and latency, which can undermine the goals of optimization the router was intended to achieve.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Semantic Routing
&lt;/h3&gt;

&lt;p&gt;Semantic routing offers a powerful compromise, combining the speed of rule-based systems with the intelligence of LLM-based approaches. It operates on the principle of semantic similarity in vector space and is the core mechanism we’ll implement.&lt;/p&gt;

&lt;p&gt;The process involves four steps. First, routes are defined, each with a name and a list of representative example phrases, or utterances. Next, a text embedding model converts all of these utterances into high-dimensional numerical vectors that capture their semantic meaning, which are then stored in an efficient index. When a new user query arrives, the same embedding model converts it into a vector. Finally, a vector similarity search is performed between the query’s vector and all the utterance vectors in the index, and the route whose utterances are most similar to the query is selected as the winner.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pros:&lt;/strong&gt; This method is fast, with decision times often in the milliseconds, because it relies on optimized vector math rather than a slow, generative LLM call. It’s highly scalable to thousands of potential routes and is more robust than simple keyword matching because it understands meaning and context. Modern libraries often allow this configuration to be externalized into declarative files like YAML, separating the routing logic from the application code for better maintainability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; The effectiveness of a semantic router is highly dependent on the quality and comprehensiveness of the example utterances provided for each route. It can also struggle with contextual, multi-turn conversational queries where the user’s intent is not explicitly stated in their most recent message.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The choice of routing architecture is governed by the “Router Latency Paradox”: a component designed to reduce overall application latency must itself be exceptionally low-latency. An LLM-based router introduces a full inference step to every request, increasing both latency and cost. For this approach to be a net positive, the downstream savings must consistently outweigh its operational overhead, which is a high bar for most interactive applications. Semantic routing, in contrast, replaces this slow inference with a near-instantaneous vector search. This performance difference establishes semantic routing as the default architectural best practice for dynamic, real-time model routing. LLM-based routing is thus reserved for cases where the routing logic is too complex to be captured by semantic similarity alone and the added latency is an acceptable trade-off.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Gemini 2.5 Model Family
&lt;/h3&gt;

&lt;p&gt;To build an effective router, you need a solid grasp of the candidate models in your pool. For our implementation, we’ll use Google’s Gemini 2.5 family, a suite of models with a tiered structure of capability and cost that’s perfect for a routing architecture.&lt;/p&gt;

&lt;p&gt;A key innovation across the Gemini 2.5 family is their capability as “thinking models.” This means they can be configured to perform internal reasoning steps, akin to a chain of thought, before generating a final response. This feature, controllable via an API parameter known as the “thinking budget,” can significantly improve performance and accuracy on complex tasks. This controllable reasoning becomes another powerful dimension for our routing logic to consider.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gemini 2.5 Pro
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Capabilities:&lt;/strong&gt; &lt;a href="https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-pro?utm_campaign=CDR_0x2b6f3004_user-journey_b440933914&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Gemini 2.5 Pro&lt;/a&gt; is Google’s flagship model, engineered for maximum performance and state-of-the-art accuracy. It’s optimized for the most complex and demanding tasks, including deep logical reasoning, advanced code generation, and sophisticated multimodal understanding across text, images, audio, and video.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Router Use Case:&lt;/strong&gt; This is our designated &lt;strong&gt;“strong” model&lt;/strong&gt;. We’ll route only the most challenging queries here: prompts that involve complex problem-solving, novel algorithm design, in-depth analysis of dense technical documents, or multi-step logical puzzles.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thinking:&lt;/strong&gt; For this model, the “thinking” capability is on by default, as it’s integral to its high-end performance.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Gemini 2.5 Flash
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Capabilities:&lt;/strong&gt; &lt;a href="https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash?utm_campaign=CDR_0x2b6f3004_user-journey_b440933914&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Gemini 2.5 Flash&lt;/a&gt; is designed to be the best model in the family in terms of its price-to-performance ratio. It offers well-rounded, powerful capabilities that approach those of Pro but at a significantly lower operational cost. It also features a controllable thinking budget.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Router Use Case:&lt;/strong&gt; This is our &lt;strong&gt;“default” or “go-to” model&lt;/strong&gt;. It’s the workhorse that will handle the majority of general-purpose queries. These are tasks that are more complex than simple classification but don’t require the full power (and expense) of Pro. Ideal use cases include general conversation, creative writing, drafting emails, and performing detailed summarizations.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Gemini 2.5 Flash-Lite
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Capabilities:&lt;/strong&gt; As its name suggests, &lt;a href="https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash-lite?utm_campaign=CDR_0x2b6f3004_user-journey_b440933914&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Gemini 2.5 Flash-Lite&lt;/a&gt; is the fastest and most cost-efficient model in the 2.5 family. It’s highly optimized for low latency and high-throughput scenarios, making it a cost-effective upgrade from previous generations of Flash models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Router Use Case:&lt;/strong&gt; This is our &lt;strong&gt;fastest model&lt;/strong&gt;. We’ll route simple, high-volume, and latency-sensitive tasks here. It’s perfect for text classification (e.g., sentiment analysis), simple data extraction (e.g., pulling names and dates from text), translation, and answering straightforward factual questions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thinking:&lt;/strong&gt; To maximize its speed and cost-efficiency, “thinking” is turned off by default for Flash-Lite. However, it can be optionally enabled, providing granular control for tasks that might need a small boost in reasoning without escalating to the full Flash model.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Implementing a Semantic Router
&lt;/h3&gt;

&lt;p&gt;With the theory covered, let’s get to the code. This section walks through the &lt;a href="https://github.com/kweinmeister/gemini-model-router" rel="noopener noreferrer"&gt;gemini-model-router&lt;/a&gt; project, which builds a semantic router to intelligently distribute queries among the Gemini 2.5 Pro, Flash, and Flash-Lite models. It uses the open-source &lt;a href="https://github.com/aurelio-labs/semantic-router" rel="noopener noreferrer"&gt;semantic-router&lt;/a&gt; library as its engine and serves it all up with &lt;a href="https://fastapi.tiangolo.com/" rel="noopener noreferrer"&gt;FastAPI&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flywxo6ybkf9c2d7czg6b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flywxo6ybkf9c2d7czg6b.png" width="800" height="596"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Embeddings are created upfront for each route, and then matched to queries at runtime&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Project Setup
&lt;/h3&gt;

&lt;p&gt;To get started, clone the repository and follow the setup instructions in the &lt;a href="http://readme.md" rel="noopener noreferrer"&gt;README.md&lt;/a&gt; file, which covers creating the .env file and installing the required dependencies from requirements.txt.&lt;/p&gt;
&lt;h3&gt;
  
  
  Centralizing Configuration
&lt;/h3&gt;

&lt;p&gt;A key architectural decision in the gemini-model-router project is the separation of configuration from code. All routing logic, including the routes, their representative utterances, and the specific LLM assigned to each route, is defined in a single &lt;a href="https://github.com/kweinmeister/gemini-model-router/blob/main/router.yaml" rel="noopener noreferrer"&gt;router.yaml&lt;/a&gt; file. This makes the system highly maintainable and easy to modify without changing the application’s Python code.&lt;/p&gt;

&lt;p&gt;The router.yaml file has two main sections:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;encoder&lt;/strong&gt; : Specifies the embedding model to use for converting text to vectors. In this case, it uses Google’s &lt;a href="https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemini-embedding-001?utm_campaign=CDR_0x2b6f3004_user-journey_b440933914&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;gemini-embedding-001&lt;/a&gt; via the semantic-router’s &lt;a href="https://docs.aurelio.ai/semantic-router/client-reference/encoders/google" rel="noopener noreferrer"&gt;GoogleEncoder&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;routes&lt;/strong&gt; : A list of route definitions. Each route has:&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;name: A unique identifier that maps directly to a Gemini model.&lt;/li&gt;
&lt;li&gt;description: A human-readable explanation of the route’s purpose.&lt;/li&gt;
&lt;li&gt;utterances: A list of example phrases that define the semantic space of the route.&lt;/li&gt;
&lt;li&gt;llm: An object specifying the custom class (GoogleLLM), the Python module where it’s defined (main), and the target model ID (e.g., gemini-2.5-pro).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here is a snippet from the router.yaml file, defining the route for complex queries. A key parameter in the full configuration is the score_threshold. When the router compares a query to its routes, it calculates a similarity score. By setting the threshold to 0.0, we ensure that the router always selects the route with the highest similarity, effectively guaranteeing that a decision is always made.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# router.yaml
encoder_name: gemini-embedding-001
encoder_type: google
routes:
- name: gemini-2.5-pro
  description: For complex, multi-step tasks requiring deep reasoning, code generation, and analysis of large documents.
  utterances:
  - Develop a comprehensive, multi-year business plan for a direct-to-consumer sustainable
    fashion brand, including financial projections and marketing strategies.
  - Write a Python script to perform sentiment analysis on a large CSV of customer
    reviews, generate visualizations, and create a summary report.
  - Compare and contrast the philosophical implications of determinism and free will
    in the context of advanced artificial intelligence, citing relevant academic sources.
  llm:
    module: main
    class: GoogleLLM
    model: gemini-2.5-pro
#... other routes for flash and flash-lite follow...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Routing Logic
&lt;/h3&gt;

&lt;p&gt;The main.py file contains the FastAPI application that serves the router. It includes several key components that work together to bring the YAML configuration to life.&lt;/p&gt;

&lt;h4&gt;
  
  
  The GoogleLLM Wrapper
&lt;/h4&gt;

&lt;p&gt;The semantic-router library requires a compatible LLM object for each route. To integrate with Google’s GenAI SDK, the project defines a custom GoogleLLM class that inherits from &lt;a href="https://docs.aurelio.ai/semantic-router/client-reference/llms/base" rel="noopener noreferrer"&gt;semantic_router.llms.BaseLLM&lt;/a&gt;. This class acts as a bridge, translating the semantic-router’s call signature into an asynchronous request to the Vertex AI Gemini API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# main.py (simplified)
from semantic_router.llms import BaseLLM
from google import genai

class GoogleLLM(BaseLLM):
    _client: ClassVar[Optional[genai.Client]] = None

    @classmethod
    def get_client(cls) -&amp;gt; genai.Client:
        if cls._client is None:
            project_id = os.getenv("GOOGLE_CLOUD_PROJECT")
            cls._client = genai.Client(vertexai=True, project=project_id)
        return cls._client

    async def __acall__ (self, messages: List[Message], **kwargs) -&amp;gt; Optional[str]:
        contents = kwargs.get("multimodal_contents", messages[0].content)
        config = kwargs.get("config", self.kwargs.get("config", {}))

        response = await self.get_client().aio.models.generate_content(
            model=self.name,
            contents=contents,
            **config,
        )
        return response.text if response else ""
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  The /query Endpoint
&lt;/h4&gt;

&lt;p&gt;The main API endpoint uses a series of helper functions to route and execute the query. The handle_query function orchestrates the process: it extracts text for routing, determines the best route, and executes the LLM call.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# main.py (simplified)
@app.post("/query", response_model=RouterResponse)
async def handle_query(request: QueryRequest, fastapi_request: Request):
    router = fastapi_request.app.state.router
    default_route = fastapi_request.app.state.default_route_name

    # 1. Extract text and determine the route
    text_for_routing = _get_text_for_routing(request.contents)
    route_choice = _determine_route(router, text_for_routing, default_route)
    chosen_route = router.get(route_choice.name)

    # 2. Execute the call using the LLM from the chosen route
    model_response = await _execute_llm_call(
        chosen_route, request.contents, request.config, text_for_routing
    )

    return RouterResponse(
        route_name=chosen_route.name, model_response=model_response
    )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Deploying to Production
&lt;/h3&gt;

&lt;p&gt;While FastAPI’s web server &lt;a href="https://www.uvicorn.org/" rel="noopener noreferrer"&gt;uvicorn&lt;/a&gt; is perfect for local development, a production deployment requires a robust, scalable hosting environment. &lt;a href="https://cloud.google.com/run?utm_campaign=CDR_0x2b6f3004_user-journey_b440933914&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Cloud Run&lt;/a&gt; is an ideal choice for this service because it’s a fully managed, serverless platform that takes your containerized application (including the Uvicorn server) and handles all the underlying infrastructure, scaling, and request management.&lt;/p&gt;

&lt;p&gt;To deploy the router, you first need to have the Google Cloud SDK installed and configured. Then, you can deploy the service with a single command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gcloud run deploy gemini-model-router \
  --source . \
  --region us-central1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command builds a container from your source code, pushes it to the Artifact Registry, and deploys it as a public-facing service. Cloud Run handles all the infrastructure, so you can focus on the application logic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Production Best Practices
&lt;/h3&gt;

&lt;p&gt;Deploying a model router to production requires building an observable and resilient system. An API management platform like Google Cloud’s &lt;a href="https://cloud.google.com/apigee/api-management?utm_campaign=CDR_0x2b6f3004_user-journey_b440933914&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Apigee&lt;/a&gt; can serve as a unified and secure gateway to your model routing service. It can provide essential capabilities like enforcing security policies, managing traffic with rate limiting and quotas, and offering deep visibility through analytics and monitoring. Let’s review the key principles needed to move beyond a proof-of-concept.&lt;/p&gt;

&lt;p&gt;First, treat the router as a mission-critical, standalone service. Because it can be a single point of failure and a performance bottleneck, it must be independently scalable and fault-tolerant. Containerize the router and deploy it on a platform like &lt;a href="https://cloud.google.com/run?utm_campaign=CDR_0x2b6f3004_user-journey_b440933914&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Cloud Run&lt;/a&gt; to ensure high availability, allowing it to scale independently of the applications that consume it.&lt;/p&gt;

&lt;p&gt;Second, you cannot optimize what you cannot measure. Implement comprehensive logging and monitoring for every routing decision. For each request, log the chosen route, similarity score, final model, latency, and estimated cost. This data can be fed into &lt;a href="https://cloud.google.com/observability?utm_campaign=CDR_0x2b6f3004_user-journey_b440933914&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Google Cloud’s observability suite&lt;/a&gt; to create dashboards for tracking key performance indicators like route distribution, cost per query, and P99 latency. This allows you to set up alerts for anomalies, such as a sudden shift in routing patterns or an increase in fallback rates.&lt;/p&gt;

&lt;p&gt;Third, the initial configuration is just a starting point. True optimization requires a data-driven feedback loop. Collect and review production queries to identify misrouted requests, and use this analysis to refine your route utterances. A/B testing frameworks are invaluable for comparing different routing strategies or model configurations in a live environment to validate improvements.&lt;/p&gt;

&lt;p&gt;Finally, enterprise-grade reliability requires planning for failure. Implement a chain of fallbacks that goes beyond a simple default route. For instance, if a request to gemini-2.5-pro fails, the system should automatically retry with exponential backoff. If that also fails, it should fall back to the next best model, gemini-2.5-flash.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Future of Model Routing
&lt;/h3&gt;

&lt;p&gt;There is a broader trend towards more modular and dynamic AI architectures, and model routing is no exception. The future of model routing could include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multimodal Routing:&lt;/strong&gt; The next logical step is routing on more than just text. The current router simplifies the problem by extracting the text from a multimodal prompt, but the concept of vector similarity works for any modality you can embed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hierarchical Routing:&lt;/strong&gt; The concept of system-level model routing is a macro-scale analog of what &lt;a href="https://huggingface.co/blog/moe" rel="noopener noreferrer"&gt;Mixture-of-Experts&lt;/a&gt; or MoE architectures do within a single neural network. In an MoE model, an internal “router” network dynamically selects which “expert” sub-networks should process each token of an input sequence. Our external router does the same thing, but its “experts” are entire, independent LLMs. Future systems may employ hierarchical routing, where a top-level semantic router first selects the best specialized MoE model for a task, which then performs its own fine-grained, internal routing to process the request.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ultimately, model routing is a foundational building block for the next generation of complex, multi-agent AI systems. As we’ve shown, the combination of a powerful model family like &lt;a href="https://cloud.google.com/vertex-ai/generative-ai/docs/models?utm_campaign=CDR_0x2b6f3004_user-journey_b440933914&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Google’s Gemini 2.5&lt;/a&gt;, a serverless platform like &lt;a href="https://cloud.google.com/run?utm_campaign=CDR_0x2b6f3004_user-journey_b440933914&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Cloud Run&lt;/a&gt;, and the open-source &lt;a href="https://github.com/kweinmeister/gemini-model-router" rel="noopener noreferrer"&gt;gemini-model-router&lt;/a&gt; project makes this advanced architecture an achievable engineering task. The tools are here. The patterns are clear.&lt;/p&gt;

&lt;p&gt;It’s time to start building. Share what you’ve built with me on &lt;a href="https://www.linkedin.com/in/karlweinmeister/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;, &lt;a href="https://x.com/kweinmeister" rel="noopener noreferrer"&gt;X&lt;/a&gt;, or &lt;a href="https://bsky.app/profile/kweinmeister.bsky.social" rel="noopener noreferrer"&gt;Bluesky&lt;/a&gt;!&lt;/p&gt;




</description>
      <category>largelanguagemodels</category>
      <category>googlecloudrun</category>
      <category>routing</category>
      <category>gemini</category>
    </item>
    <item>
      <title>Mastering Agentic Development with Gemini and Roo Code</title>
      <dc:creator>Karl Weinmeister</dc:creator>
      <pubDate>Sun, 20 Jul 2025 04:56:25 +0000</pubDate>
      <link>https://dev.to/kweinmeister/mastering-agentic-development-with-gemini-and-roo-code-4j64</link>
      <guid>https://dev.to/kweinmeister/mastering-agentic-development-with-gemini-and-roo-code-4j64</guid>
      <description>&lt;p&gt;The conversation around AI in software development has matured beyond the “AI as a chatbot” and into sophisticated AI agents. We’re moving toward building a living blueprint that can reason about your code in its entirety and evolve with it over time.&lt;/p&gt;

&lt;p&gt;For developers who want a powerful, all-in-one AI experience, Google’s &lt;a href="https://cloud.google.com/gemini/docs/codeassist/overview?utm_campaign=CDR_0x2b6f3004_user-journey_b431570178&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Gemini Code Assist&lt;/a&gt; is a fantastic solution that provides a seamless, out-of-the-box experience, bringing the power of Gemini directly into your workflow.&lt;/p&gt;

&lt;p&gt;For those who love to assemble best-in-class technologies from the open ecosystem, this article is for you. We will explore a production-ready stack for those who want a customized and self-hosted solution. This stack combines the &lt;a href="https://roocode.com/" rel="noopener noreferrer"&gt;Roo Code&lt;/a&gt; VS Code extension, powered by Google’s underlying &lt;a href="https://ai.google.dev/?utm_campaign=CDR_0x2b6f3004_default_b431570178&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Gemini models&lt;/a&gt;, and takes it to the next level with a self-hosted &lt;a href="https://qdrant.tech/" rel="noopener noreferrer"&gt;Qdrant&lt;/a&gt; vector database on &lt;a href="https://cloud.google.com/kubernetes-engine?utm_campaign=CDR_0x2b6f3004_user-journey_b431570178&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Google Kubernetes Engine&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F944%2F0%2AT6PRKm-ZJFVlSYr6" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F944%2F0%2AT6PRKm-ZJFVlSYr6"&gt;&lt;/a&gt;&lt;/p&gt;
Solution architecture for agentic development with Roo Code, Gemini, and Qdrant



&lt;h3&gt;
  
  
  Solution Components
&lt;/h3&gt;

&lt;p&gt;Roo Code is a VS Code extension that can be thought of as an “AI Dev Team” with modes ranging from Architect to Debug. You can give it a high-level task, like “refactor this module to use the new logging service,” and it will create a plan, identify the necessary code changes, and execute them across multiple files. For a deeper dive, check out the &lt;a href="https://docs.roocode.com/" rel="noopener noreferrer"&gt;Roo Code documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F419%2F0%2ALlT1pB1La4zfEaTX" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F419%2F0%2ALlT1pB1La4zfEaTX"&gt;&lt;/a&gt;&lt;/p&gt;
Using Roo Code to update a project README based on the current codebase



&lt;p&gt;You can take full advantage of Roo Code’s capabilities with the massive context window available in Gemini models. This allows Roo Code to hold a vast amount of code in its “short-term memory,” enabling it to understand the intricate relationships between files and modules and to generate code that is consistent with the entire project. You can learn more about the Gemini API in the &lt;a href="https://ai.google.dev/gemini-api/docs?utm_campaign=CDR_0x2b6f3004_default_b431570178&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;official documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To make the use of this large context window efficient, Roo Code leverages prompt caching, a feature &lt;a href="https://x.com/roo_code/status/1915590059291811873" rel="noopener noreferrer"&gt;now available&lt;/a&gt; in Gemini models. When Roo Code sends the initial instructions and context to the model, Gemini generates an internal representation and returns a cache reference. On subsequent requests, Roo Code can send this cache reference instead of the full prompt, dramatically reducing token usage and improving latency, which is a key feature for making the system both cost-effective and performant.&lt;/p&gt;

&lt;p&gt;For codebase indexing, Roo Code supports Gemini’s gemini-embedding-001 state-of-the-art &lt;a href="https://deepmind.google/research/publications/157741/" rel="noopener noreferrer"&gt;embedding model&lt;/a&gt;. This is crucial for the accuracy of the semantic search, and you can find more information on Gemini’s &lt;a href="https://ai.google.dev/gemini-api/docs/embeddings?utm_campaign=CDR_0x2b6f3004_default_b431570178&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;embedding models here&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using Gemini Models in Roo Code
&lt;/h3&gt;

&lt;p&gt;The connection between Roo Code and a model is what enables its agentic capabilities: planning, executing commands, and writing code across your entire project. You can connect to Gemini’s models through the Gemini API or through Google Cloud’s Vertex AI.&lt;/p&gt;

&lt;p&gt;To use the Gemini API, you simply create an API key in &lt;a href="https://aistudio.google.com/" rel="noopener noreferrer"&gt;&lt;strong&gt;Google AI Studio&lt;/strong&gt;&lt;/a&gt;, then in Roo Code’s settings, select the &lt;strong&gt;Google Gemini&lt;/strong&gt; provider, paste your key, and choose a model. For detailed, step-by-step instructions on this process, refer to the &lt;a href="https://docs.roocode.com/providers/gemini" rel="noopener noreferrer"&gt;Roo Code documentation for the Gemini provider&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For teams and enterprises using Google Cloud, connecting via &lt;strong&gt;Vertex AI&lt;/strong&gt; provides unified billing, IAM permissions, and more. You will create a service account with the “Vertex AI User” role in the Google Cloud Console and download its JSON key file. Within Roo Code’s settings, select the &lt;strong&gt;GCP Vertex AI&lt;/strong&gt; provider, provide the credentials from your JSON key, and enter your Project ID and Region. The &lt;a href="https://docs.roocode.com/providers/vertex" rel="noopener noreferrer"&gt;Roo Code documentation for Vertex AI&lt;/a&gt; provides a complete walkthrough of this setup.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F365%2F0%2A-vyV1YXVRzrlsUcU" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F365%2F0%2A-vyV1YXVRzrlsUcU"&gt;&lt;/a&gt;&lt;/p&gt;
The Vertex AI LLM provider for Gemini in Roo Code



&lt;p&gt;For both connection methods, we recommend starting with &lt;strong&gt;gemini-2.5-pro&lt;/strong&gt; for the best experience. Its powerful reasoning capabilities and large context window are ideal for complex, multi-step tasks. For faster, more cost-effective use, &lt;strong&gt;gemini-2.5-flash&lt;/strong&gt; is an excellent alternative.&lt;/p&gt;

&lt;p&gt;With Roo Code’s reasoning engine now powered by Gemini, the next step is to give it a persistent, long-term memory of your code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Codebase Indexing
&lt;/h3&gt;

&lt;p&gt;Codebase indexing creates a semantic “long-term memory” of your code that the agent can access at any time. This is a multi-stage process that transforms your source code into a searchable knowledge base.&lt;/p&gt;

&lt;h4&gt;
  
  
  Intelligent Chunking
&lt;/h4&gt;

&lt;p&gt;First, Roo Code uses &lt;a href="https://tree-sitter.github.io/tree-sitter/" rel="noopener noreferrer"&gt;Tree-sitter&lt;/a&gt; to parse your code into an Abstract Syntax Tree (AST). This gives it a deep, structural understanding of your code, just like a compiler does. Instead of arbitrarily splitting a file every few hundred lines, the AST is used to intelligently chunk the code into complete, semantic blocks.&lt;/p&gt;

&lt;p&gt;This “semantic chunking” means the pieces of code being indexed are meaningful and self-contained units, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A complete function or method.&lt;/li&gt;
&lt;li&gt;An entire class or struct definition.&lt;/li&gt;
&lt;li&gt;A specific configuration block.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This ensures that the context isn’t lost by splitting a function in half. For unsupported languages, Roo Code falls back to line-based chunking.&lt;/p&gt;

&lt;h3&gt;
  
  
  Generating Embeddings
&lt;/h3&gt;

&lt;p&gt;Once the code is broken down into these intelligent chunks, the next step is to capture their semantic meaning in a way a machine can understand. This is where Gemini’s gemini-embedding-001 model comes in.&lt;/p&gt;

&lt;p&gt;Each semantic chunk produced by Tree-sitter is fed into the embedding model, which outputs a high-dimensional numerical vector. This vector is the &lt;strong&gt;embedding&lt;/strong&gt;  — a mathematical representation of the code’s meaning. The Gemini embedding model captures fine details with 3072 dimensions in every embedding. For a deeper dive into &lt;a href="https://arxiv.org/pdf/2205.13147" rel="noopener noreferrer"&gt;Matryoshka Representation Learning&lt;/a&gt;, a technique used to train the model, see this video:&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/VQosEgOw84s"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;h3&gt;
  
  
  Storing and Searching Embeddings
&lt;/h3&gt;

&lt;p&gt;With the codebase converted into a collection of semantically-rich embeddings, they need a place to be stored and searched efficiently. Roo Code uses Qdrant, a high-performance vector database, for this purpose.&lt;/p&gt;

&lt;p&gt;When you ask a question, Roo Code’s search tool follows this process:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Query:&lt;/strong&gt; Your natural language query (e.g., “where is our user authentication logic?”) is sent to the Gemini embedding model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vectorize:&lt;/strong&gt; The model converts your query into an embedding vector.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Search:&lt;/strong&gt; Roo Code performs a vector search in the Qdrant database, looking for the code chunk embeddings that are most similar (i.e., closest in vector space) to your query’s embedding.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieve:&lt;/strong&gt; The tool then returns the most relevant code snippets, along with their file paths and similarity scores.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Roo Code also provides a user-friendly interface for configuring the codebase indexer. You can easily select your embedding provider, enter your API keys, and specify the Qdrant URL. The advanced configuration options allow you to fine-tune the search behavior by adjusting the Search Score Threshold and Maximum Search Results. You can also specify which files to ignore by adding patterns to a .rooignore file.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F413%2F0%2AgE-jXByfgTOYcx-8" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F413%2F0%2AgE-jXByfgTOYcx-8"&gt;&lt;/a&gt;&lt;/p&gt;
Indexing a codebase in Roo Code



&lt;h3&gt;
  
  
  From Local to Centralized Indexing
&lt;/h3&gt;

&lt;p&gt;The easiest way to get started is with a local Qdrant instance. As the official &lt;a href="https://qdrant.tech/documentation/quickstart/" rel="noopener noreferrer"&gt;Qdrant Quickstart&lt;/a&gt; shows, you can be up and running in minutes with a single Docker command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker run -p 6333:6333 -v "$(pwd)/qdrant_storage:/qdrant/storage:z" qdrant/qdrant
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For an individual developer, this is a fantastic way to get all the benefits of codebase indexing without any external dependencies.&lt;/p&gt;

&lt;p&gt;As your team grows, managing dozens of individual Docker instances can become cumbersome. This is where a centralized Qdrant instance provides value — not as a single, conflict-prone shared index, but as a managed, cost-effective platform to host a &lt;em&gt;fleet&lt;/em&gt; of personal indexes.&lt;/p&gt;

&lt;p&gt;Google Kubernetes Engine, or GKE, is an excellent choice for this, offering high availability and enterprise-grade security. The principle is the same regardless of the platform: provide a robust, central service to host many isolated environments. You can deploy the infrastructure within minutes using the &lt;a href="https://cloud.google.com/kubernetes-engine/docs/tutorials/deploy-qdrant?utm_campaign=CDR_0x2b6f3004_user-journey_b431570178&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;GKE tutorial for deploying Qdrant&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Using the instructions in the tutorial, you can easily access it from your local system using &lt;a href="https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/" rel="noopener noreferrer"&gt;port forwarding&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PROJECT_ID="your-project-id"
REGION="us-central1"

gcloud container clusters get-credentials qdrant-cluster --region "$REGION" --project "$PROJECT_ID"

kubectl port-forward service/qdrant 6333:6333
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Roo Code generates a unique Qdrant collection name by hashing the absolute local workspace path. This means that even when using a central Qdrant instance, each developer’s index is completely isolated. To avoid conflicts, each developer needs to ensure they are using a different path:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Developer A: /Users/alice/projects/my-app&lt;/li&gt;
&lt;li&gt;Developer B: /Users/bob/projects/my-app&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;The future of AI-assisted development is about choice. Whether you prefer a powerful, all-in-one solution like Google’s &lt;a href="https://cloud.google.com/gemini/docs/codeassist/overview?utm_campaign=CDR_0x2b6f3004_user-journey_b431570178&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Gemini Code Assist&lt;/a&gt; for a seamless, integrated experience, or the composable stack detailed in this article, the goal is the same: to create a truly intelligent development environment.&lt;/p&gt;

&lt;p&gt;What will you build with Gemini and Roo Code? Feel free to continue the discussion on &lt;a href="https://www.linkedin.com/in/karlweinmeister/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;, &lt;a href="https://x.com/kweinmeister?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor" rel="noopener noreferrer"&gt;X&lt;/a&gt;, and &lt;a href="https://bsky.app/profile/kweinmeister.bsky.social" rel="noopener noreferrer"&gt;Bluesky&lt;/a&gt;.&lt;/p&gt;




</description>
      <category>roocode</category>
      <category>googlegemini</category>
      <category>embedding</category>
      <category>aicodingassistant</category>
    </item>
    <item>
      <title>Getting started with Rust on Google Cloud</title>
      <dc:creator>Karl Weinmeister</dc:creator>
      <pubDate>Thu, 27 Mar 2025 04:29:49 +0000</pubDate>
      <link>https://dev.to/googlecloud/getting-started-with-rust-on-google-cloud-4hln</link>
      <guid>https://dev.to/googlecloud/getting-started-with-rust-on-google-cloud-4hln</guid>
      <description>&lt;p&gt;This post will guide you through deploying a simple “Hello, World!” application on &lt;a href="https://cloud.google.com/run?utm_campaign=CDR_0x2b6f3004_default_b403810548&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Cloud Run&lt;/a&gt;. You’ll then extend the application by showing how to integrate with Google Cloud services with experimental &lt;a href="https://github.com/googleapis/google-cloud-rust" rel="noopener noreferrer"&gt;Rust client libraries&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I’ll cover the necessary code, Dockerfile configuration, and deployment steps. I’ll also recommend a robust and scalable stack for building web services, especially when combined with Google Cloud’s serverless platform, Cloud Run.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AX_eDJ5lRKkKc64Ut" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AX_eDJ5lRKkKc64Ut" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Rust and Axum?
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.rust-lang.org/" rel="noopener noreferrer"&gt;Rust&lt;/a&gt; has gained significant traction in backend development, earning the title of &lt;a href="https://survey.stackoverflow.co/2024/technology#2-programming-scripting-and-markup-languages" rel="noopener noreferrer"&gt;most-admired language&lt;/a&gt; in the StackOverflow 2024 Developer Survey. This popularity stems from its core strengths: performance, memory safety, and reliability. Rust’s low-level control and zero-cost abstractions enable &lt;a href="https://nnethercote.github.io/perf-book/title-page.html" rel="noopener noreferrer"&gt;highly performant&lt;/a&gt; applications. Its &lt;a href="https://doc.rust-lang.org/book/ch04-00-understanding-ownership.html" rel="noopener noreferrer"&gt;ownership system&lt;/a&gt; prevents common programming errors like data races and null pointer dereferences. In addition, Rust’s strong &lt;a href="https://doc.rust-lang.org/reference/type-system.html" rel="noopener noreferrer"&gt;type system&lt;/a&gt; and compile-time checks catch errors early in the development process, leading to more reliable software.&lt;/p&gt;

&lt;p&gt;The Rust web framework ecosystem is vibrant and evolving. Popular choices include &lt;a href="https://github.com/tokio-rs/axum" rel="noopener noreferrer"&gt;Axum&lt;/a&gt;, &lt;a href="https://rocket.rs/" rel="noopener noreferrer"&gt;Rocket&lt;/a&gt;, and &lt;a href="https://github.com/actix/actix-web" rel="noopener noreferrer"&gt;Actix&lt;/a&gt;. In this post, I’ll showcase &lt;a href="https://github.com/tokio-rs/axum" rel="noopener noreferrer"&gt;Axum&lt;/a&gt;, but you can apply what you’ve learned here to other Rust web frameworks. Axum’s API is clear and composable, making it easy to build web services. Its modular architecture allows developers to select only the necessary components. Axum is built on &lt;a href="https://tokio.rs/" rel="noopener noreferrer"&gt;Tokio&lt;/a&gt;, a popular asynchronous runtime for Rust, which allows it to handle concurrency and I/O operations efficiently.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hello World Application
&lt;/h3&gt;

&lt;p&gt;Let’s start by exploring a basic “Hello, World!” &lt;a href="https://github.com/tokio-rs/axum/tree/main/examples/hello-world" rel="noopener noreferrer"&gt;example&lt;/a&gt; from the official Axum repository. In each section of this blog post, you will enhance the example to leverage Google Cloud capabilities. You can access the final code sample in the &lt;a href="https://github.com/kweinmeister/cloud-rust-example" rel="noopener noreferrer"&gt;cloud-rust-example&lt;/a&gt; repository.&lt;/p&gt;

&lt;p&gt;First, the &lt;a href="https://github.com/tokio-rs/axum/blob/main/examples/hello-world/Cargo.toml" rel="noopener noreferrer"&gt;Cargo.toml&lt;/a&gt; manifest file defines the project’s metadata and dependencies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[package]&lt;/span&gt;
&lt;span class="py"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"example-hello-world"&lt;/span&gt;
&lt;span class="py"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"0.1.0"&lt;/span&gt;
&lt;span class="py"&gt;edition&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"2021"&lt;/span&gt;
&lt;span class="py"&gt;publish&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;

&lt;span class="nn"&gt;[dependencies]&lt;/span&gt;
&lt;span class="py"&gt;axum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="py"&gt;path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"../../axum"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="py"&gt;tokio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="py"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"1.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="py"&gt;features&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"full"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Within this file, you see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;[package]&lt;/code&gt;: Contains basic project information like name, version, and the Rust edition. &lt;code&gt;publish = false&lt;/code&gt; prevents accidental publication.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;[dependencies]&lt;/code&gt;: Lists the project’s dependencies — &lt;code&gt;axum&lt;/code&gt; for the web framework and &lt;code&gt;tokio&lt;/code&gt; for asynchronous capabilities.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now, let’s examine the core application code, &lt;a href="https://github.com/tokio-rs/axum/blob/main/examples/hello-world/src/main.rs" rel="noopener noreferrer"&gt;src/main.rs&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;axum&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="nn"&gt;response&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Html&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;routing&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Router&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="nd"&gt;#[tokio::main]&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// build our application with a route&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Router&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

    &lt;span class="c1"&gt;// run it&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;listener&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;tokio&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;net&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;TcpListener&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;bind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"127.0.0.1:3000"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;.await&lt;/span&gt;
        &lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"listening on {}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;listener&lt;/span&gt;&lt;span class="nf"&gt;.local_addr&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
    &lt;span class="nn"&gt;axum&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;serve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;listener&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Html&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;'static&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;Html&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"&amp;lt;h1&amp;gt;Hello, World!&amp;lt;/h1&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code sets up a minimal web server using Axum and Tokio. The #[tokio::main] macro enables asynchronous execution. The main function creates a Router to handle requests, defines a single route / that responds with “Hello, World!”, binds the server to 127.0.0.1:3000, and starts the server. The handler function generates the HTML response for the root route.&lt;/p&gt;

&lt;h3&gt;
  
  
  Enhancements for Cloud Run
&lt;/h3&gt;

&lt;p&gt;The basic example above works well for local development, but let’s make some improvements for deploying to Cloud Run. The official example notably does &lt;em&gt;not&lt;/em&gt; include a Dockerfile, which is required for Cloud Run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Standalone Deployment:&lt;/strong&gt; To make the example standalone and deployable, modify the Cargo.toml file. Change the axum dependency from &lt;code&gt;axum = { path = “../../axum” }&lt;/code&gt; to &lt;code&gt;axum = “0.8”&lt;/code&gt; to use the published version of Axum from &lt;a href="http://crates.io" rel="noopener noreferrer"&gt;crates.io&lt;/a&gt; instead of the local path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Dynamic Port Configuration:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cloud Run dynamically assigns a port to your application, which is provided through the PORT environment variable. The original example hardcodes the port to 3000. To make our application Cloud Run-compatible, modify the main function to read the PORT environment variable and use it if available, falling back to a default port such as 8080 if the variable is not set.&lt;/p&gt;

&lt;p&gt;The address should also be changed to 0.0.0.0 to listen on all network interfaces, which is generally preferred for containerized applications.&lt;/p&gt;

&lt;p&gt;Here’s the modified main function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="nd"&gt;#[tokio::main]&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Get the port from the environment, defaulting to 8080&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"PORT"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.unwrap_or_else&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="s"&gt;"8080"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;addr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nd"&gt;format!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"0.0.0.0:{}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// build our application with a route&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Router&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

    &lt;span class="c1"&gt;// run it&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;listener&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;tokio&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;net&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;TcpListener&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;bind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;addr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;.await&lt;/span&gt;
        &lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"listening on {}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;listener&lt;/span&gt;&lt;span class="nf"&gt;.local_addr&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
    &lt;span class="nn"&gt;axum&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;serve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;listener&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Dockerfile:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To deploy to Cloud Run, you’ll need a Dockerfile. Here’s a simple one that works well for this example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; rust:1.85.1&lt;/span&gt;
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;cargo build &lt;span class="nt"&gt;--release&lt;/span&gt;
&lt;span class="k"&gt;EXPOSE&lt;/span&gt;&lt;span class="s"&gt; 8080&lt;/span&gt;
&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["./target/release/example-hello-world"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This Dockerfile uses the official &lt;a href="https://hub.docker.com/_/rust" rel="noopener noreferrer"&gt;Rust image&lt;/a&gt; as a base, copies the project files, builds the application in release mode, exposes port 8080 (&lt;a href="https://cloud.google.com/run/docs/container-contract#port?utm_campaign=CDR_default_0x80ca756c&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;the default port&lt;/a&gt;), and sets the command to run the compiled executable. You can upgrade to the latest Rust image if you’d like.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. .gcloudignore file:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can also add a .gcloudignore file to the project root to exclude unnecessary files (like the target directory containing build artifacts) from the deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.git/
.gitignore
target/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Deploying to Cloud Run
&lt;/h3&gt;

&lt;p&gt;Before deploying, ensure you have the &lt;a href="https://cloud.google.com/sdk/docs/install-sdk?utm_campaign=CDR_0x2b6f3004_default_b403810548&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Google Cloud SDK&lt;/a&gt; installed and configured, and you have &lt;a href="https://console.cloud.google.com/flows/enableapi?apiid=run.googleapis.com&amp;amp;utm_campaign=CDR_0x2b6f3004_default_b403810548&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;enabled the Cloud Run API&lt;/a&gt; in your Google Cloud project. You’ll also need to be in the root directory of your Axum project (where the Cargo.toml file is located).&lt;/p&gt;

&lt;p&gt;Before attempting your deployment, you can &lt;a href="https://doc.rust-lang.org/cargo/commands/cargo-check.html" rel="noopener noreferrer"&gt;check&lt;/a&gt; the local package and deployment for errors:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cargo check
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To deploy directly to Cloud Run &lt;a href="https://cloud.google.com/run/docs/deploying-source-code?utm_campaign=CDR_default_0x80ca756c&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;from source&lt;/a&gt;, use the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run deploy cloud-rust-example &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--source&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--allow-unauthenticated&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here’s what each part of the command means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;gcloud run deploy cloud-rust-example&lt;/code&gt;: This is the base command to deploy a service to Cloud Run. &lt;code&gt;cloud-rust-example&lt;/code&gt; is the name we’re giving to our service. You can choose a different name.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;—-source .&lt;/code&gt;: This flag tells Cloud Run where to find the source code for your application. The . indicates the current directory. Cloud Run will use the Dockerfile in this directory to build a container image.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;—-region us-central1&lt;/code&gt;: This specifies the Google Cloud region where your service will be deployed. In this case, we’re using us-central1. You can choose a region closer to your users for lower latency.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;—-allow-unauthenticated&lt;/code&gt;: This flag makes your deployed service publicly accessible without requiring authentication. This is convenient for initial testing and simple public services. &lt;strong&gt;For production applications, you should remove this flag and implement proper authentication and authorization.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cloud Run will automatically build and deploy your application. You will be provided with a service URL in the output. Accessing this URL in your browser will display the “Hello, World!” message.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F573%2F0%2AQJmbsamFXPgavNTB" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F573%2F0%2AQJmbsamFXPgavNTB" width="573" height="135"&gt;&lt;/a&gt;&lt;/p&gt;
Hello world output from / route



&lt;h3&gt;
  
  
  Integrating with Google Cloud Services
&lt;/h3&gt;

&lt;p&gt;Let’s now show how to integrate our application with Google Cloud services. I’ve selected a straightforward scenario that doesn’t require any project configuration to work. You’ll add a new application route &lt;code&gt;/project&lt;/code&gt; that will display information about your project.&lt;/p&gt;

&lt;p&gt;To implement this, you’ll use the &lt;a href="https://github.com/googleapis/google-cloud-rust" rel="noopener noreferrer"&gt;google-cloud-rust&lt;/a&gt; library to interact with the &lt;a href="https://cloud.google.com/resource-manager/docs?utm_campaign=CDR_0x2b6f3004_default_b403810548&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Resource Manager&lt;/a&gt; API and retrieve information about your Google Cloud project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; The google-cloud-rust library is currently experimental. APIs may change, and it’s important to stay updated with the latest releases and documentation.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Add Dependencies&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;First, add the Resource Manager v3 API and &lt;a href="https://docs.rs/reqwest/latest/reqwest/" rel="noopener noreferrer"&gt;reqwest&lt;/a&gt; HTTP client to your Cargo.toml file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cargo add google-cloud-resourcemanager-v3 reqwest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;Implement the handler&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;There are four key changes we’ll need to make in src/main.rs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Add /project Route:&lt;/strong&gt; A new route &lt;code&gt;/project&lt;/code&gt; will display project information, implemented by &lt;code&gt;project_handler()&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Router&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="nf"&gt;.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/project"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project_handler&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="nf"&gt;.layer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;Extension&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;sync&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;Arc&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;)));&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;project_handler function:&lt;/strong&gt; The project handler will call &lt;a href="https://docs.rs/google-cloud-resourcemanager-v3/latest/google_cloud_resourcemanager_v3/client/struct.Projects.html#method.get_project" rel="noopener noreferrer"&gt;get_project()&lt;/a&gt; to fetch project details. Finally, it formats the project information into an HTML response. Error handling is included to display any errors that occur during the API call.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;project_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;Extension&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="n"&gt;Extension&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;Arc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Projects&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Html&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;project_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="nf"&gt;.get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Project ID not initialized"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;project_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nd"&gt;format!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"projects/{}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="nf"&gt;.get_project&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.send&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;project_number&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="py"&gt;.name&lt;/span&gt;&lt;span class="nf"&gt;.strip_prefix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"projects/"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.unwrap_or&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Unknown"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

            &lt;span class="nf"&gt;Html&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nd"&gt;format!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="s"&gt;"&amp;lt;h1&amp;gt;Project Info&amp;lt;/h1&amp;gt;&amp;lt;ul&amp;gt;&amp;lt;li&amp;gt;Name: &amp;lt;code&amp;gt;{}&amp;lt;/code&amp;gt;&amp;lt;/li&amp;gt;&amp;lt;li&amp;gt;ID: &amp;lt;code&amp;gt;{}&amp;lt;/code&amp;gt;&amp;lt;/li&amp;gt;&amp;lt;li&amp;gt;Number: &amp;lt;code&amp;gt;{}&amp;lt;/code&amp;gt;&amp;lt;/li&amp;gt;&amp;lt;/ul&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="py"&gt;.display_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;project_number&lt;/span&gt;
            &lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nf"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;Html&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nd"&gt;format!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"&amp;lt;h1&amp;gt;Error getting project info: {}&amp;lt;/h1&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Share client with handler:&lt;/strong&gt; For best performance, any one-time configuration should not reside in the handler. The &lt;a href="https://docs.rs/google-cloud-resourcemanager-v3/latest/google_cloud_resourcemanager_v3/client/struct.Projects.html#" rel="noopener noreferrer"&gt;Projects&lt;/a&gt; client can be initialized in main() and then shared with the handler with Axum’s &lt;a href="https://docs.rs/axum/latest/axum/struct.Extension.html" rel="noopener noreferrer"&gt;Extension&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add helper function for project metadata&lt;/strong&gt; : To find out the project ID the container is running in, you’ll need to access the &lt;a href="https://cloud.google.com/resource-manager/docs?utm_campaign=CDR_0x2b6f3004_default_b403810548&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;metadata key&lt;/a&gt;. That project ID will then be used to call the Resource Manager API to get more &lt;a href="https://docs.rs/google-cloud-resourcemanager-v3/latest/google_cloud_resourcemanager_v3/model/struct.Project.html" rel="noopener noreferrer"&gt;information about the project&lt;/a&gt;, including its display name and creation time. You can use &lt;a href="https://doc.rust-lang.org/std/sync/struct.LazyLock.html" rel="noopener noreferrer"&gt;LazyLock&lt;/a&gt; to initialize the project only once.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;OnceLock&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;OnceLock&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;project_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_project_id&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Failed to get project ID"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="nf"&gt;.set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Failed to set PROJECT_ID"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;get_project_id&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"GOOGLE_CLOUD_PROJECT"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;reqwest&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"http://metadata.google.internal/computeMetadata/v1/project/project-id"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;
        &lt;span class="nf"&gt;.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;.header&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Metadata-Flavor"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Google"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;.send&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="nf"&gt;.status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.is_success&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="nf"&gt;.text&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="nf"&gt;.map_err&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="nf"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nd"&gt;format!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Metadata server returned error: {}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="nf"&gt;.status&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nf"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nd"&gt;format!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Error querying metadata server: {}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;Set GOOGLE_CLOUD_PROJECT Environment Variable (Locally)&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;For local testing, you’ll need to set the &lt;code&gt;GOOGLE_CLOUD_PROJECT&lt;/code&gt; environment variable to your Google Cloud project ID. You can do this in your terminal before running the application:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_CLOUD_PROJECT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-project-id
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace &lt;code&gt;your-project-id&lt;/code&gt; with your actual project ID. Cloud Run will automatically set this environment variable when deployed.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Enable the Resource Manager API&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;If you haven’t already, make sure to enable the &lt;a href="https://console.cloud.google.com/apis/api/cloudresourcemanager.googleapis.com/overview?utm_campaign=CDR_default_0xd368824c&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Resource Manager API&lt;/a&gt; within your Google Cloud project.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Provide Resource Manager IAM access&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;You will need to provide the &lt;a href="https://cloud.google.com/resource-manager/docs/access-control-proj#permissions?utm_campaign=CDR_default_0xd368824c&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;resourcemanager.projects.get&lt;/a&gt; role to the appropriate &lt;a href="https://cloud.google.com/run/docs/securing/service-identity#types-of-service-accounts?utm_campaign=CDR_0x2b6f3004_default_b403810548&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Cloud Run service account&lt;/a&gt;. The instructions here use the Compute Engine default service account. If you are running locally, you’ll also need to provide these permissions to your account.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Redeploy to Cloud Run&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Use the same &lt;code&gt;gcloud run deploy&lt;/code&gt; command as before to redeploy your updated application:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run deploy cloud-rust-example &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--source&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--allow-unauthenticated&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, when you visit the service URL provided by Cloud Run and navigate to the &lt;code&gt;/project&lt;/code&gt; path, you should see information about your Google Cloud project.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F716%2F0%2AojC486ePfJcfZ30r" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F716%2F0%2AojC486ePfJcfZ30r" width="716" height="440"&gt;&lt;/a&gt;&lt;/p&gt;
Project information output from /project route



&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;This guide demonstrates the process of deploying a Rust Axum application on Cloud Run. I started with a basic “Hello, World!” example from the Axum repository, explained its code, and then showed how to enhance it for Cloud Run compatibility by dynamically configuring the port and creating a Dockerfile. By combining Rust and Axum with Cloud Run’s serverless simplicity, you can efficiently build and deploy robust web services. The sample source code is available in the &lt;a href="https://github.com/kweinmeister/cloud-rust-example" rel="noopener noreferrer"&gt;cloud-rust-example&lt;/a&gt; repository.&lt;/p&gt;

&lt;p&gt;For more information about Cloud Run, I recommend the &lt;a href="https://cloud.google.com/run/docs/quickstarts/build-and-deploy/deploy-service-other-languages?utm_campaign=CDR_default_0x80ca756c&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;quickstart&lt;/a&gt; for building and deploying a web application in the documentation. Also, check out &lt;a href="https://www.youtube.com/watch?v=rOMroL3mhO4" rel="noopener noreferrer"&gt;this video&lt;/a&gt; for a video walkthrough of running Rust on Cloud Run. Feel free to connect on &lt;a href="https://www.linkedin.com/in/karlweinmeister/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;, &lt;a href="https://x.com/kweinmeister?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor" rel="noopener noreferrer"&gt;X&lt;/a&gt;, and &lt;a href="https://bsky.app/profile/kweinmeister.bsky.social" rel="noopener noreferrer"&gt;Bluesky&lt;/a&gt; to continue the discussion!&lt;/p&gt;




</description>
      <category>dockerfiles</category>
      <category>web</category>
      <category>axum</category>
      <category>rust</category>
    </item>
    <item>
      <title>AI Appraiser: Discover the value of your items with Gemini on Google Cloud</title>
      <dc:creator>Karl Weinmeister</dc:creator>
      <pubDate>Sun, 09 Mar 2025 04:01:51 +0000</pubDate>
      <link>https://dev.to/googlecloud/ai-appraiser-discover-the-value-of-your-items-with-gemini-on-google-cloud-n4p</link>
      <guid>https://dev.to/googlecloud/ai-appraiser-discover-the-value-of-your-items-with-gemini-on-google-cloud-n4p</guid>
      <description>&lt;p&gt;While you were out shopping or cleaning up around the house, have you ever wondered what an item is worth? Estimating the value of items can be tricky, often requiring expert knowledge or time-consuming research. What if you could get a quick, AI-powered appraisal with just a picture?&lt;/p&gt;

&lt;p&gt;That’s the idea behind &lt;a href="https://github.com/kweinmeister/ai-appraiser" rel="noopener noreferrer"&gt;AI Appraiser&lt;/a&gt;, a small project I recently built using the &lt;a href="https://cloud.google.com/vertex-ai/generative-ai/docs/sdks/overview?utm_source=ext&amp;amp;utm_medium=partner&amp;amp;utm_campaign=CDR_kwe_gcp_content_020225&amp;amp;utm_content=-" rel="noopener noreferrer"&gt;Google Gen AI SDK&lt;/a&gt;. Whether you’re assessing the value of a beloved collectible, figuring out a fair price for secondhand goods, or simply curious about the worth of everyday objects, AI Appraiser offers a user-friendly way to get AI-powered insights.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7bgtwue1gln2ohlz0wmm.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7bgtwue1gln2ohlz0wmm.gif" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;
Appraising a Google Chromecast remote with AI Appraiser



&lt;h3&gt;
  
  
  From curiosity to code
&lt;/h3&gt;

&lt;p&gt;The project started with a question: could I leverage the power of Gemini’s multimodal capabilities and its integrated search to build a practical tool?&lt;/p&gt;

&lt;p&gt;I was particularly interested in exploring Gemini’s &lt;a href="https://cloud.google.com/vertex-ai/generative-ai/docs/grounding/overview#ground-public?utm_source=ext&amp;amp;utm_medium=partner&amp;amp;utm_campaign=CDR_kwe_gcp_content_020225&amp;amp;utm_content=-" rel="noopener noreferrer"&gt;&lt;strong&gt;grounding with Google Search&lt;/strong&gt;&lt;/a&gt; capabilities. This feature seemed perfect for fetching up-to-date pricing information directly from the web, which is crucial for accurate item valuation. By grounding the AI’s analysis in real-time search results, I aimed to build an app that provides more reliable and informed estimates.&lt;/p&gt;

&lt;p&gt;To bring AI Appraiser from a concept to reality, I focused on the user experience. Putting myself in the shoes of someone wanting to quickly check the value of an item, I thought, “Okay, what’s the &lt;em&gt;first&lt;/em&gt; thing I’d naturally do?” The answer was: &lt;em&gt;show&lt;/em&gt; the app the item. That’s why I started by focusing on the image upload feature, really trying to make it feel as smooth and effortless as possible. Drag and drop, click to upload — I wanted it to feel completely natural, like you were just showing the app what you had.&lt;/p&gt;

&lt;p&gt;But even with a smooth image uploader, I knew that relying on pictures alone would be limiting, as photos don’t always tell the whole story. Think about those subtle details that can swing an item’s value wildly — the pristine condition of a vintage collectible, the telltale model number on a piece of tech, or those unique markings that authenticate a piece of art. To bridge this gap, I knew I needed to give users a way to add their own insights. That’s why I made sure to include an optional text description box.&lt;/p&gt;

&lt;p&gt;Integrating with the &lt;a href="https://cloud.google.com/vertex-ai/generative-ai/docs/sdks/overview?utm_source=ext&amp;amp;utm_medium=partner&amp;amp;utm_campaign=CDR_kwe_gcp_content_020225&amp;amp;utm_content=-" rel="noopener noreferrer"&gt;Google Gen AI SDK&lt;/a&gt; is the heart of the application. All of the key pieces are documented in the &lt;a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/getting-started/intro_gemini_2_0_flash.ipynb" rel="noopener noreferrer"&gt;Getting Started notebook&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Passing in multimodal image data (the image and optional description)&lt;/li&gt;
&lt;li&gt;Enabling Grounding with Search to improve the valuation quality&lt;/li&gt;
&lt;li&gt;Enabling &lt;a href="https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output?utm_source=ext&amp;amp;utm_medium=partner&amp;amp;utm_campaign=CDR_kwe_gcp_content_020225&amp;amp;utm_content=-" rel="noopener noreferrer"&gt;controlled generation&lt;/a&gt; to ensure the response format is consistent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With Gemini’s grounded search, you can think of the application as a research assistant that can instantly scour countless sources across the web, pulling in pricing data, market trends, and comparable listings.&lt;/p&gt;

&lt;p&gt;Beyond that, I didn’t want AI Appraiser to just spit out a valuation and leave you guessing. It wanted it to &lt;em&gt;show&lt;/em&gt; you Gemini’s reasoning, laying out the key factors it considered and providing direct links to the sources it used. This way, you’re not just getting a number; you’re getting a glimpse into the AI’s thought process, and you can judge for yourself whether the estimate makes sense.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture: Keeping it Simple and Serverless
&lt;/h3&gt;

&lt;p&gt;For this project, I wanted to keep the architecture straightforward and leverage serverless technologies for ease of deployment and scalability. Here’s the basic setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Frontend:&lt;/strong&gt; Built with HTML, &lt;a href="https://tailwindcss.com/" rel="noopener noreferrer"&gt;Tailwind CSS&lt;/a&gt;, and &lt;a href="https://htmx.org/" rel="noopener noreferrer"&gt;HTMX&lt;/a&gt;. HTMX is a great little library that allows for dynamic UI updates without writing complex JavaScript, perfect for a project like this. Tailwind CSS helped with rapid styling, and plain HTML provided the structure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backend:&lt;/strong&gt; A &lt;a href="https://fastapi.tiangolo.com/" rel="noopener noreferrer"&gt;FastAPI&lt;/a&gt; application in Python. FastAPI is excellent for building APIs quickly and efficiently. It handles image uploads, interacts with &lt;a href="https://cloud.google.com/storage?utm_source=ext&amp;amp;utm_medium=partner&amp;amp;utm_campaign=CDR_kwe_gcp_content_020225&amp;amp;utm_content=-" rel="noopener noreferrer"&gt;Google Cloud Storage&lt;/a&gt; (optionally), and calls the Gemini API.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini API:&lt;/strong&gt; The star of the show! &lt;a href="https://cloud.google.com/vertex-ai/generative-ai/docs/sdks/overview?utm_source=ext&amp;amp;utm_medium=partner&amp;amp;utm_campaign=CDR_kwe_gcp_content_020225&amp;amp;utm_content=-" rel="noopener noreferrer"&gt;Gemini&lt;/a&gt; handles the heavy lifting of image analysis, search, and valuation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here’s a simplified diagram:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2phr0muh6oywdk9gugp5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2phr0muh6oywdk9gugp5.png" width="800" height="221"&gt;&lt;/a&gt;&lt;/p&gt;
Architecture diagram of AI Appraiser on Google Cloud



&lt;h3&gt;
  
  
  Key features in action
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Image Upload and Preview&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;The frontend provides a drag-and-drop interface for image uploads. It uses the &lt;a href="https://flowbite.com/docs/forms/file-input/#dropzone" rel="noopener noreferrer"&gt;Dropzone&lt;/a&gt; file input component from the &lt;a href="https://flowbite.com/" rel="noopener noreferrer"&gt;Flowbite CSS library&lt;/a&gt;. HTMX handles the asynchronous upload to the backend, and the image is immediately displayed for preview.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;input&lt;/span&gt;
   &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"image_file"&lt;/span&gt;
   &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"image_file"&lt;/span&gt;
   &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;"file"&lt;/span&gt;
   &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"hidden"&lt;/span&gt;
   &lt;span class="na"&gt;hx-post=&lt;/span&gt;&lt;span class="s"&gt;"/upload-image"&lt;/span&gt;
   &lt;span class="na"&gt;hx-target=&lt;/span&gt;&lt;span class="s"&gt;"#image-preview"&lt;/span&gt;
   &lt;span class="na"&gt;hx-encoding=&lt;/span&gt;&lt;span class="s"&gt;"multipart/form-data"&lt;/span&gt;
&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;AI-Powered valuation with Gemini&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;The backend uses the Gemini API, specifically leveraging its &lt;strong&gt;grounding with Google Search&lt;/strong&gt; capability. As highlighted in the &lt;a href="https://ai.google.dev/gemini-api/docs/grounding?lang=python" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;, this feature allows Gemini to augment its responses with real-time information from Google Search. The prompt instructs Gemini to act as a professional appraiser, using search to find comparable items and provide a reasoned valuation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;valuation_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a professional appraiser, adept at determining the value of items based on their description and market data.
Here is additional information provided by the user: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.
Your task is to estimate the item&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s fair market value
&lt;/span&gt;&lt;span class="gp"&gt;...&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="n"&gt;continues&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
google_search_tool = Tool(google_search=GoogleSearch())
config_with_search = GenerateContentConfig(tools=[google_search_tool])

response_with_search = client.models.generate_content(
    model=MODEL_ID,
    contents=[
        Part.from_uri(file_uri=image_uri, mime_type=guess_type(image_uri)[0]),
            valuation_prompt,
    ],
    config=config_with_search,
)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;Structured output and display&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;The Gemini API response is parsed, and now the estimated value, reasoning, and source URLs conform to a consistent data structure using &lt;a href="https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output?utm_source=ext&amp;amp;utm_medium=partner&amp;amp;utm_campaign=CDR_kwe_gcp_content_020225&amp;amp;utm_content=-" rel="noopener noreferrer"&gt;controlled generation&lt;/a&gt;. HTMX then handles updating the results section dynamically.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ValuationResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;estimated_value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;currency&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;reasoning&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;search_urls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# ...
&lt;/span&gt;
&lt;span class="n"&gt;config_for_parsing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GenerateContentConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;response_mime_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;response_schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ValuationResponse&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Deployment to Cloud Run
&lt;/h3&gt;

&lt;p&gt;Deploying AI Appraiser is very straightforward with &lt;a href="https://cloud.google.com/run?utm_source=ext&amp;amp;utm_medium=partner&amp;amp;utm_campaign=CDR_kwe_gcp_content_020225&amp;amp;utm_content=-" rel="noopener noreferrer"&gt;Cloud Run&lt;/a&gt;. The &lt;a href="https://github.com/kweinmeister/ai-appraiser/blob/main/Dockerfile" rel="noopener noreferrer"&gt;Dockerfile&lt;/a&gt; is included in the repository, and deployment is as simple as running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud run deploy ai-appraiser &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--source&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; &lt;span class="nv"&gt;$LOCATION&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--allow-unauthenticated&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set-env-vars&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"GOOGLE_CLOUD_PROJECT=&lt;/span&gt;&lt;span class="nv"&gt;$GOOGLE_CLOUD_PROJECT&lt;/span&gt;&lt;span class="s2"&gt;,STORAGE_BUCKET=&lt;/span&gt;&lt;span class="nv"&gt;$STORAGE_BUCKET&lt;/span&gt;&lt;span class="s2"&gt;,MODEL_ID=&lt;/span&gt;&lt;span class="nv"&gt;$MODEL_ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cloud Run handles containerization, deployment, scaling, and infrastructure management, allowing you to focus on the application code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Try it out and contribute!
&lt;/h3&gt;

&lt;p&gt;AI Appraiser is a fun little project that demonstrates the power of combining Gemini with Google Cloud. It’s not meant for professional appraisals, but it can be a handy tool for getting a quick estimate or just satisfying your curiosity.&lt;/p&gt;

&lt;p&gt;The code is available on GitHub: &lt;a href="https://github.com/kweinmeister/ai-appraiser" rel="noopener noreferrer"&gt;https://github.com/kweinmeister/ai-appraiser&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Feel free to clone the repository, deploy it to your own Google Cloud project, and experiment with it. Contributions and feedback are welcome! What other creative applications can we build with Gemini and Google Cloud? Let’s continue the conversation on &lt;a href="https://www.linkedin.com/in/karlweinmeister/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;, &lt;a href="https://twitter.com/kweinmeister" rel="noopener noreferrer"&gt;X&lt;/a&gt;, and &lt;a href="https://bsky.app/profile/kweinmeister.bsky.social" rel="noopener noreferrer"&gt;Bluesky&lt;/a&gt;.&lt;/p&gt;




</description>
      <category>googlecloudplatform</category>
      <category>gemini</category>
      <category>machinelearning</category>
      <category>ai</category>
    </item>
    <item>
      <title>Attention Evolved: How Multi-Head Latent Attention Works</title>
      <dc:creator>Karl Weinmeister</dc:creator>
      <pubDate>Sun, 02 Mar 2025 08:48:32 +0000</pubDate>
      <link>https://dev.to/googlecloud/attention-evolved-how-multi-head-latent-attention-works-458g</link>
      <guid>https://dev.to/googlecloud/attention-evolved-how-multi-head-latent-attention-works-458g</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0hyil4vn4elztnvfha6b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0hyil4vn4elztnvfha6b.png" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;
Compressing keys and values to reduce the cache size is MLA’s key innovation



&lt;p&gt;Attention is the fundamental mechanism that enables LLMs to capture long-range dependencies and context. It was introduced as part of the Transformer architecture in 2017 in &lt;a href="https://arxiv.org/abs/1706.03762" rel="noopener noreferrer"&gt;Attention is All You Need&lt;/a&gt;. As models have grown in size, new variants have emerged to address the computational cost of standard Multi-Head attention.&lt;/p&gt;

&lt;p&gt;Multi-Head Latent Attention (MLA), introduced in the &lt;a href="https://arxiv.org/abs/2405.04434" rel="noopener noreferrer"&gt;Deepseek-V2 paper&lt;/a&gt;, represents a novel approach to efficient attention. MLA introduces a low-rank compression technique that reduces the memory footprint without sacrificing model performance. MLA is used by the &lt;a href="https://console.cloud.google.com/vertex-ai/publishers/deepseek-ai/model-garden/deepseek-v3?utm_source=ext&amp;amp;utm_medium=partner&amp;amp;utm_campaign=CDR_kwe_gcp_content_020225&amp;amp;utm_content=-" rel="noopener noreferrer"&gt;DeepSeek-V3&lt;/a&gt; and &lt;a href="https://console.cloud.google.com/vertex-ai/publishers/deepseek-ai/model-garden/deepseek-r1?utm_source=ext&amp;amp;utm_medium=partner&amp;amp;utm_campaign=CDR_kwe_gcp_content_020225&amp;amp;utm_content=-" rel="noopener noreferrer"&gt;DeepSeek-R1&lt;/a&gt; models available in the &lt;a href="https://console.cloud.google.com/vertex-ai/model-garden??utm_source=ext&amp;amp;utm_medium=partner&amp;amp;utm_campaign=CDR_kwe_gcp_content_020225&amp;amp;utm_content=-" rel="noopener noreferrer"&gt;Vertex Model Garden&lt;/a&gt; and deployable to &lt;a href="https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-multihost-gpu??utm_source=ext&amp;amp;utm_medium=partner&amp;amp;utm_campaign=CDR_kwe_gcp_content_020225&amp;amp;utm_content=-" rel="noopener noreferrer"&gt;Google Kubernetes Engine&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In this blog post, we’ll start with the standard attention mechanism and build up to what makes Multi-Head Latent Attention special.&lt;/p&gt;

&lt;h3&gt;
  
  
  Attention
&lt;/h3&gt;

&lt;p&gt;To illustrate how single-head attention works, consider this example sentence:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The animal didn’t cross the street, because &lt;strong&gt;it&lt;/strong&gt; was too tired.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;How can we understand what “it” means? We need to look at surrounding words, or more precisely, tokens. Attention allows us to analyze this context mathematically.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F323%2F0%2Arc5vzyA4vvVNgWI3" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F323%2F0%2Arc5vzyA4vvVNgWI3" width="323" height="267"&gt;&lt;/a&gt;&lt;/p&gt;
Self-attention for pronoun “it” (source)



&lt;p&gt;There are three components that make attention work: queries, keys, and values. We compare the &lt;strong&gt;Query&lt;/strong&gt; word (“it”) to each &lt;strong&gt;Key&lt;/strong&gt; in the sentence using the dot product. The dot product measures how “similar” two vectors are. A higher dot product means more “attention” or relevance. This is reflected in the QKᵀ term of the attention calculation. A scaling factor, based on the dimension of the keys, is applied to keep the dot products from becoming too large.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F808%2F0%2AKG2B2oNSX0aCWvDG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F808%2F0%2AKG2B2oNSX0aCWvDG" width="800" height="134"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We then use these scores to weigh the &lt;strong&gt;Value&lt;/strong&gt; vectors, which hold the actual information associated with each key. Words that are more “relevant” get more weight in understanding our Query word. This weighted sum of these Value vectors becomes the Attention Output — a context-aware representation of our word “it”.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AaxE7QNM6qJ9H01d8" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AaxE7QNM6qJ9H01d8" width="800" height="596"&gt;&lt;/a&gt;&lt;/p&gt;
Attention mechanism overview (source)



&lt;h3&gt;
  
  
  Positional Encodings
&lt;/h3&gt;

&lt;p&gt;Before we move on, there’s one more crucial piece to the puzzle: &lt;strong&gt;positional encodings&lt;/strong&gt;. So far, we haven’t considered the order of words in the sentence. Without this information, “Cross the street” and “Street the cross” would be treated identically because attention, by itself, is order-agnostic.&lt;/p&gt;

&lt;p&gt;To address this, we add positional encodings. These are special vectors that tell the attention mechanism the position of each word in the sequence. In &lt;a href="https://arxiv.org/abs/1706.03762" rel="noopener noreferrer"&gt;Attention is All you Need&lt;/a&gt;, sinusoidal positional encodings were used. They create a unique “position signature” for each word with sine and cosine functions of different frequencies.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2A1RmryiLYLmAJrxv5" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2A1RmryiLYLmAJrxv5" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;
Visualization of positional encodings (source)



&lt;p&gt;These positional encodings are added to the input word representations before we calculate Queries, Keys, and Values. Altogether, the attention mechanism is now position-aware.&lt;/p&gt;

&lt;p&gt;While much of the original Transformers architecture has stood the test of time, a new approach to positional encodings is now commonly used. Rotary Position Embeddings, or RoPE, was introduced in 2021 in the &lt;a href="https://arxiv.org/abs/2104.09864v5" rel="noopener noreferrer"&gt;RoFormer&lt;/a&gt; architecture. Rather than adding a position term, RoPE rotates the query and key vectors based on their relative positions. This rotation allows the model to understand the relationship between words based on their distance from each other.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi5vqmik7827zu6yro69d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi5vqmik7827zu6yro69d.png" width="628" height="533"&gt;&lt;/a&gt;&lt;/p&gt;
Illustration of position embedded into a vector through rotation



&lt;p&gt;For a more detailed understanding of positional encoding techniques, I recommend the &lt;a href="https://huggingface.co/blog/designing-positional-encoding" rel="noopener noreferrer"&gt;Designing Positional Encoding&lt;/a&gt; blog post.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-Head Attention
&lt;/h3&gt;

&lt;p&gt;So far, we’ve heard about attention mechanics using one set of Query, Key, and Value projections. To capture more nuanced relationships, we can use &lt;strong&gt;Multi-Head Attention&lt;/strong&gt; or &lt;strong&gt;MHA&lt;/strong&gt;. MHA uses multiple sets of QKV projections — “Head 1”, “Head 2”, “Head 3”, and so on. Each head learns to focus on different aspects of the relationship between words. For example, one head might focus on grammatical relationships and another on semantic relationships like synonymy or antonymy. Each head calculates its own attention output, and MHA concatenates these outputs and projects the resulting vector to obtain the final output.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F656%2F0%2AkT12xBmUGd22izKc" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F656%2F0%2AkT12xBmUGd22izKc" width="656" height="600"&gt;&lt;/a&gt;&lt;/p&gt;
Multi-Head attention calculation (source)



&lt;p&gt;As Key and Value tensors for earlier tokens in a sequence remain the same, they can be cached to avoid unnecessary computations. This &lt;a href="https://huggingface.co/blog/not-lain/kv-caching" rel="noopener noreferrer"&gt;key-value (KV) cache&lt;/a&gt; can become a memory bottleneck, slowing down inference, especially for longer texts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fin6jarb1155oih0fk63c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fin6jarb1155oih0fk63c.png" width="800" height="417"&gt;&lt;/a&gt;&lt;/p&gt;
Illustration of how the KV Cache grows with each token in the sequence



&lt;h3&gt;
  
  
  Multi-Query and Grouped-Query Attention
&lt;/h3&gt;

&lt;p&gt;Fortunately, new techniques have emerged to address this issue, by reducing the number of keys and values.&lt;/p&gt;

&lt;p&gt;Let’s look at Multi-Query Attention, or MQA, first. Instead of each Query head having its own Key and Value set, all Query heads share a single Key and Value set. Because the KV cache size is a function of the dimensions of each head, MQA can significantly reduce the cache size.&lt;/p&gt;

&lt;p&gt;However, there’s a trade-off. By sharing a single Key and Value set, model performance can suffer. Grouped-Query Attention, or GQA, is a middle ground. Instead of one shared Key and Value in MQA, it uses a small number of shared Key and Value sets, called “groups”.&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/o68RRGxAtDo"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;h3&gt;
  
  
  Introducing MLA
&lt;/h3&gt;

&lt;p&gt;Ideally, we want to shrink the KV Cache without sacrificing performance. And that”s where MLA, or Multi-Head Latent Attention comes in.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2Ab44FSXeyIMW5sWG_" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2Ab44FSXeyIMW5sWG_" width="800" height="216"&gt;&lt;/a&gt;&lt;/p&gt;
KV Cache Comparison (source)



&lt;p&gt;MLA tackles this KV Cache problem with compression. MLA compresses, or down-projects, keys and values into a smaller &lt;a href="https://en.wikipedia.org/wiki/Low-rank_approximation" rel="noopener noreferrer"&gt;low-rank&lt;/a&gt; matrix. This compressed matrix is then up-projected during the attention calculation. We can see how a compressed latent matrix in lower-dimensional space is derived with a down-projection matrix Wᴰᴷⱽ. Keys and values can be up-converted with Wᵁᴷ and Wᵁⱽ, respectively.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F316%2F0%2A1M1xPtuVTtJOV7Ba" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F316%2F0%2A1M1xPtuVTtJOV7Ba" width="316" height="205"&gt;&lt;/a&gt;&lt;/p&gt;
Calculation of compressed latent matrix, keys, and values



&lt;p&gt;MLA requires a modified approach to address token positions called decoupled RoPE. Standard RoPE directly modifies compressed keys and values with position information. MLA’s compression technique complicates this and would hinder inference efficiency. Instead, with decoupled RoPE, the relative position information is encoded in separate vectors that are then concatenated with the compressed keys and values before the up-projection. This allows for efficient application of positional information without interfering with the compression/decompression process.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe8o57yi61gg3afobggii.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe8o57yi61gg3afobggii.png" width="286" height="188"&gt;&lt;/a&gt;&lt;/p&gt;
Query (q) and Key (k) vectors are formed by concatenating their compressed (C) and positional (R) parts



&lt;p&gt;In the end, Multi-Head Latent Attention provides faster inference and smaller memory usage. According to the &lt;a href="https://arxiv.org/abs/2405.04434" rel="noopener noreferrer"&gt;Deepseek-V2 paper&lt;/a&gt;, this approach maintains or even improves performance compared to Multi-Head attention. Compressing the keys and values to a low-rank representation apparently doesn’t lose too much information, and may even aid generalization.&lt;/p&gt;

&lt;p&gt;To try out MLA in action, you can deploy DeepSeek-R1 or DeepSeek-V3 671B on Google Kubernetes Engine (GKE) using graphical processing units (GPUs) across multiple nodes with &lt;a href="https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-multihost-gpu??utm_source=ext&amp;amp;utm_medium=partner&amp;amp;utm_campaign=CDR_kwe_gcp_content_020225&amp;amp;utm_content=-" rel="noopener noreferrer"&gt;this guide&lt;/a&gt;. You can also deploy DeepSeek from the &lt;a href="https://console.cloud.google.com/vertex-ai/model-garden??utm_source=ext&amp;amp;utm_medium=partner&amp;amp;utm_campaign=CDR_kwe_gcp_content_020225&amp;amp;utm_content=-" rel="noopener noreferrer"&gt;Vertex Model Garden&lt;/a&gt;. Feel free to connect on &lt;a href="https://www.linkedin.com/in/karlweinmeister/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;, &lt;a href="https://x.com/kweinmeister?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor" rel="noopener noreferrer"&gt;X&lt;/a&gt;, and &lt;a href="https://bsky.app/profile/kweinmeister.bsky.social" rel="noopener noreferrer"&gt;Bluesky&lt;/a&gt; to continue the discussion!&lt;/p&gt;




</description>
      <category>machinelearning</category>
      <category>googlecloudplatform</category>
      <category>deepseek</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
