<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Google Developer Experts</title>
    <description>The latest articles on DEV Community by Google Developer Experts (@gde).</description>
    <link>https://dev.to/gde</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F11939%2Fe3080d5b-ecde-42a8-b089-bafecc31fa97.png</url>
      <title>DEV Community: Google Developer Experts</title>
      <link>https://dev.to/gde</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gde"/>
    <language>en</language>
    <item>
      <title>[Gemini][Agent] Google Managed Agents API</title>
      <dc:creator>Evan Lin</dc:creator>
      <pubDate>Wed, 03 Jun 2026 01:01:36 +0000</pubDate>
      <link>https://dev.to/gde/geminiagent-google-managed-agents-api-4e43</link>
      <guid>https://dev.to/gde/geminiagent-google-managed-agents-api-4e43</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe9nwsti79ib9ae970q7q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe9nwsti79ib9ae970q7q.png" alt="image-20260602220526732" width="800" height="542"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;(Image Source: &lt;a href="https://docs.cloud.google.com/gemini-enterprise-agent-platform/build/managed-agents" rel="noopener noreferrer"&gt;Google Cloud Docs - Managed Agents on Agent Platform&lt;/a&gt;)&lt;/p&gt;

&lt;h1&gt;
  
  
  Preamble: The era of hand-rolling your own agent loop is coming to an end
&lt;/h1&gt;

&lt;p&gt;In the past, if you wanted to build an AI agent that could truly " &lt;strong&gt;do things&lt;/strong&gt; ", the component list that came to mind probably looked something like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An LLM main loop (ReAct? Write your own state machine?)&lt;/li&gt;
&lt;li&gt;A sandbox to run LLM-generated code (Docker? Firecracker? E2B?)&lt;/li&gt;
&lt;li&gt;A filesystem to store intermediate files produced by the agent (S3? Local? Temporary or persistent?)&lt;/li&gt;
&lt;li&gt;A search API (Connect to Google Custom Search yourself? SerpAPI?)&lt;/li&gt;
&lt;li&gt;A page fetcher (playwright? readability-lxml?)&lt;/li&gt;
&lt;li&gt;A tool router to connect all of the above&lt;/li&gt;
&lt;li&gt;And only then, how to let the user continue the session&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And once the session broke, the &lt;code&gt;report.md&lt;/code&gt;, &lt;code&gt;sources.json&lt;/code&gt; that the agent was halfway through writing, and the venv that was halfway running, would all be gone. Nobody wants to do "I'll open a Docker for you, mount a volume, and remember to delete it in 7 days" again.&lt;/p&gt;

&lt;p&gt;These past few days, Google has turned this pipeline into " &lt;strong&gt;calling a managed API&lt;/strong&gt; " in Cloud Docs — &lt;a href="https://docs.cloud.google.com/gemini-enterprise-agent-platform/build/managed-agents" rel="noopener noreferrer"&gt;Gemini Enterprise Agent Platform&lt;/a&gt; launched the &lt;strong&gt;Managed Agents API&lt;/strong&gt; (internal codename Antigravity), which manages the sandbox, filesystem, and toolset entirely. Just pass an environment ID, and the agent's intermediate files from last time will still be waiting for you.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgx0nba37wpvwlppthq9e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgx0nba37wpvwlppthq9e.png" alt="image-20260602220556522" width="645" height="325"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This article will do two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Break down the core capabilities clearly, including what the underlying &lt;code&gt;antigravity-preview-05-2026&lt;/code&gt; model is doing.&lt;/li&gt;
&lt;li&gt; Use an &lt;strong&gt;open-source&lt;/strong&gt; LINE Research Planner Bot (&lt;a href="https://github.com/kkdai/line-research-bot" rel="noopener noreferrer"&gt;&lt;code&gt;kkdai/line-research-bot&lt;/code&gt;&lt;/a&gt;) as a live demonstration to see how new features are combined in actual production code — and share the &lt;strong&gt;five&lt;/strong&gt; typical Pre-GA pitfalls I encountered during debugging to help you avoid them.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Three Key Core Capabilities
&lt;/h2&gt;

&lt;p&gt;According to the official documentation, the core of Managed Agents revolves around three things:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Persistent Sandbox + Filesystem
&lt;/h3&gt;

&lt;p&gt;In the past, code interpreter-like functions would restart a container with each call, losing all previously &lt;code&gt;pip install&lt;/code&gt;ed packages, written files, and half-open Python interpreters.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Each agent operates within a sandboxed environment … capable of reasoning, planning, executing code, web searching, and file operations.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Now, if you make a second interaction &lt;strong&gt;with the same &lt;code&gt;environment_id&lt;/code&gt;&lt;/strong&gt;, the agent will see the &lt;code&gt;/workspace/&lt;/code&gt; from the previous session:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;/workspace/sources.json&lt;/code&gt; is still there&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/workspace/report.md&lt;/code&gt; was half-written, this time it continues to modify it&lt;/li&gt;
&lt;li&gt;Packages like &lt;code&gt;markdown&lt;/code&gt; installed with &lt;code&gt;pip install&lt;/code&gt; last time don't need to be reinstalled&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For us product builders, this means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No need to maintain your own sandbox infrastructure&lt;/strong&gt; (Firecracker, microVM, expiration cleanup).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agents can truly "complete a big task in multiple turns"&lt;/strong&gt;, instead of starting over each turn.&lt;/li&gt;
&lt;li&gt;A TTL of &lt;strong&gt;7 days&lt;/strong&gt;, during which any interaction automatically refreshes, meaning it stays alive as long as the user uses it once a week.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My LINE Bot relies on this for " &lt;strong&gt;progressive deepening&lt;/strong&gt; ": the user first says "research X" → the agent writes sources and a report in the sandbox; a few minutes later, the user says "Chapter 2, go deeper" → the agent reads back the original file, modifies Chapter 2, and rewrites it, all within the &lt;strong&gt;same sandbox and the same markdown file&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Built-in Tools
&lt;/h3&gt;

&lt;p&gt;When building an agent, you just list the tools you want, &lt;strong&gt;without having to connect to APIs yourself&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_execution&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="c1"&gt;# Python / bash / persistent venv
&lt;/span&gt;    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filesystem&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="c1"&gt;# Read/write /workspace
&lt;/span&gt;    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google_search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="c1"&gt;# Real Google Search, not Custom Search
&lt;/span&gt;    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url_context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="c1"&gt;# Feed URL to automatically fetch content + extract
&lt;/span&gt;    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mcp_server&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Any plug-in MCP server
&lt;/span&gt;     &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;grep-search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://mcp.grep.app&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Several key observations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;&lt;code&gt;google_search&lt;/code&gt; is real Google&lt;/strong&gt;, not the basic version that requires you to customize a search engine ID + API key. The return format includes search suggestions and can be used for grounding.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;&lt;code&gt;url_context&lt;/code&gt; is equivalent to free readability + content extraction&lt;/strong&gt;, feed a URL and get the main text. No need to maintain another playwright fleet.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Native MCP support&lt;/strong&gt;: You can directly integrate any &lt;a href="https://modelcontextprotocol.io/" rel="noopener noreferrer"&gt;Model Context Protocol&lt;/a&gt; server. The entire ecosystem is open.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Multi-turn Session Chaining
&lt;/h3&gt;

&lt;p&gt;Each interaction returns an &lt;code&gt;id&lt;/code&gt;. When calling the next turn, pass it as &lt;code&gt;previous_interaction_id&lt;/code&gt;, and the agent will see the &lt;strong&gt;entire conversation history + sandbox state&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;r1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interactions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research-planner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PLAN ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;remote&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="c1"&gt;# Open a new sandbox
&lt;/span&gt;    &lt;span class="n"&gt;background&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# … poll until completed …
&lt;/span&gt;
&lt;span class="n"&gt;r2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interactions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research-planner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SEARCH_COMPARE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# No need to restate context
&lt;/span&gt;    &lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;r1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environment_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Reuse sandbox
&lt;/span&gt;    &lt;span class="n"&gt;previous_interaction_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;r1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Connect history
&lt;/span&gt;    &lt;span class="n"&gt;background&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This design turns your backend into " &lt;strong&gt;only responsible for deciding what prompt to send each turn&lt;/strong&gt; ". Session state, conversation history, and file system are all server-side managed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Two APIs: Agents for Control Plane, Interactions for Data Plane
&lt;/h2&gt;

&lt;p&gt;The documentation divides into two APIs, with clear responsibilities:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;API&lt;/th&gt;
&lt;th&gt;Path&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agents API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/projects/.../agents&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Create, update, delete agent settings (base_agent, tools, system_instruction)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Interactions API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/projects/.../interactions:create&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Interact with deployed agents&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Simply put: &lt;strong&gt;Agents = Configuration&lt;/strong&gt;, &lt;strong&gt;Interactions = Execution&lt;/strong&gt;. Creating an agent is a one-time task; running interactions is done every time a user message comes in. My LINE Bot only used the Agents API once during deployment to create the agent, and after that, Cloud Run only calls the Interactions API.&lt;/p&gt;

&lt;p&gt;The underlying base model is hardcoded as &lt;code&gt;antigravity-preview-05-2026&lt;/code&gt;, which is an agent-optimized version of the Gemini series (only this one is available during the Pre-GA preview period).&lt;/p&gt;




&lt;h2&gt;
  
  
  What Developers Truly Care About: Cost and Integration Cost
&lt;/h2&gt;

&lt;p&gt;This API is still in Pre-GA, and the official documentation emphasizes:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Antigravity is offered as Pre-General Availability software, which means it is not subject to any SLA or deprecation policy. Antigravity is not intended for production use or for use with sensitive data.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In plain language:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Cannot be used for production sensitive data&lt;/strong&gt; (for compliance scenarios, please wait for GA).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;No SLA&lt;/strong&gt;, the API shape might change someday.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Might be discontinued someday&lt;/strong&gt;, don't bet your company's life on it.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Billing is at standard Vertex AI rates&lt;/strong&gt;, with no additional sandbox runtime fees — this is super friendly for demos / internal tools / hackathons.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's a very suitable entry point for personal side projects and POCs — you &lt;strong&gt;don't need to spend a month setting up sandbox infra yourself&lt;/strong&gt; to build an agent that can get things done. But don't throw enterprise customer data into it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Standard Workflow: 4 SDK Calls to Complete an Agent Interaction
&lt;/h2&gt;

&lt;p&gt;The minimum viable flow after organizing the official colab (&lt;a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/agents/managed-agents/intro_managed_agents_python.ipynb" rel="noopener noreferrer"&gt;&lt;code&gt;intro_managed_agents_python.ipynb&lt;/code&gt;&lt;/a&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;

&lt;span class="c1"&gt;# 1. Enterprise mode client (this flag is crucial, will explain in pitfalls)
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enterprise&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-project&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;global&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Create agent (one-time, reusable)
&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agents&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research-planner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;antigravity-preview-05-2026&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Multi-stage research agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system_instruction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a research planner. The first line is the stage label PLAN/SEARCH/WRITE …&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_execution&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filesystem&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google_search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url_context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 3. First interaction, open a new sandbox
&lt;/span&gt;&lt;span class="n"&gt;r1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interactions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research-planner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PLAN&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;topic: Selection of SOTA open-source vector databases&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;remote&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;background&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# ⚠️ Must be True, will explain later
&lt;/span&gt;    &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 4. Continue with the same environment
&lt;/span&gt;&lt;span class="n"&gt;r2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interactions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research-planner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SEARCH_COMPARE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;r1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environment_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;previous_interaction_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;r1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Connect history
&lt;/span&gt;    &lt;span class="n"&gt;background&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# poll for results
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;polled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interactions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;polled&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;completed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;polled&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No exaggeration, a &lt;strong&gt;multi-stage agent from scratch is less than 30 lines of code&lt;/strong&gt;. But the devil is in &lt;code&gt;background=True&lt;/code&gt; and that polling loop, which will be discussed in detail in the pitfalls section.&lt;/p&gt;




&lt;h2&gt;
  
  
  Demo Case: LINE Research Planner Bot
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxw3mbf8icvpldu0h8h18.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxw3mbf8icvpldu0h8h18.png" alt="image-20260602221558435" width="724" height="1456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkndkeuaw3vqrkfv34ccp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkndkeuaw3vqrkfv34ccp.png" alt="image-20260602221619051" width="800" height="823"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;SDK examples alone are too abstract, so I built it into a working LINE Bot, open-sourced at &lt;a href="https://github.com/kkdai/line-research-bot" rel="noopener noreferrer"&gt;&lt;code&gt;kkdai/line-research-bot&lt;/code&gt;&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  The user sends a &lt;strong&gt;research topic&lt;/strong&gt; in the LINE chat box (e.g., "Research on the selection of SOTA open-source vector databases").&lt;/li&gt;
&lt;li&gt;  The Bot plans 4-8 search queries, runs google_search + url_context, compares sources, writes a report in Traditional Chinese, and publishes it as a public HTML link.&lt;/li&gt;
&lt;li&gt;  The user then sends " &lt;strong&gt;Chapter 2, go deeper, add Japanese sources&lt;/strong&gt; " → The Bot modifies the original file in the &lt;strong&gt;same sandbox&lt;/strong&gt;, re-renders it, and keeps a snapshot of the old version.&lt;/li&gt;
&lt;li&gt;  Deployment targets: GCP Cloud Run + Firestore + GCS + Cloud Tasks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The architecture is very straightforward:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LINE Webhook&lt;/td&gt;
&lt;td&gt;FastAPI receives message events&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Firestore&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;line_bot_users / line_bot_reports&lt;/code&gt; persistence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud Tasks&lt;/td&gt;
&lt;td&gt;Pushes long-running tasks from webhook to background worker (avoids LINE reply token 60-second limit)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Managed Agent&lt;/td&gt;
&lt;td&gt;Planning + Search comparison + Writing ( &lt;strong&gt;three-stage&lt;/strong&gt; chain)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud Run worker&lt;/td&gt;
&lt;td&gt;Renders markdown → HTML → Uploads to GCS ( &lt;strong&gt;Why not in the sandbox? Pitfall 2 will explain&lt;/strong&gt; )&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GCS Bucket&lt;/td&gt;
&lt;td&gt;Public HTML hosting&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Comparing with the three core capabilities mentioned earlier:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Persistent Sandbox&lt;/strong&gt;: The three stages PLAN → SEARCH_COMPARE → WRITE_REPORT are chained within the same &lt;code&gt;environment_id&lt;/code&gt;, and sources.json written once can be read by all three stages.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Built-in Tools&lt;/strong&gt;: The SEARCH_COMPARE stage uses &lt;code&gt;google_search&lt;/code&gt; + &lt;code&gt;url_context&lt;/code&gt;. The agent decides what to search, which pages to read, and how to summarize.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Multi-turn Session&lt;/strong&gt;: "Progressive deepening" directly uses &lt;code&gt;previous_interaction_id&lt;/code&gt; to continue from the last WRITE_REPORT, and the agent naturally understands "just modify that report".&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The entire repo is about 2,500 lines of Python (including tests), completing a " &lt;strong&gt;runnable, evolvable, traceable&lt;/strong&gt; research agent."&lt;/p&gt;




&lt;h2&gt;
  
  
  Deployment Practice: Commit → Go Live Automatically
&lt;/h2&gt;

&lt;p&gt;It's not enough for the open-source example to just run; this time, the entire GCP infrastructure and CI/CD are integrated.&lt;/p&gt;

&lt;p&gt;I only provided the project ID + LINE secret, and it handled the rest end-to-end:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Enable 6 APIs&lt;/span&gt;
gcloud services &lt;span class="nb"&gt;enable &lt;/span&gt;aiplatform.googleapis.com run.googleapis.com &lt;span class="se"&gt;\&lt;/span&gt;
    cloudtasks.googleapis.com firestore.googleapis.com &lt;span class="se"&gt;\&lt;/span&gt;
    storage.googleapis.com secretmanager.googleapis.com

&lt;span class="c"&gt;# Create service account + assign 8 roles&lt;/span&gt;
gcloud iam service-accounts create line-bot-sa
&lt;span class="k"&gt;for &lt;/span&gt;role &lt;span class="k"&gt;in &lt;/span&gt;aiplatform.user datastore.user cloudtasks.enqueuer &lt;span class="se"&gt;\&lt;/span&gt;
            storage.objectAdmin secretmanager.secretAccessor &lt;span class="se"&gt;\&lt;/span&gt;
            iam.serviceAccountTokenCreator run.invoker logging.logWriter&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;gcloud projects add-iam-policy-binding line-vertex &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:line-bot-sa@line-vertex.iam.gserviceaccount.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/&lt;/span&gt;&lt;span class="nv"&gt;$role&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--condition&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;None
&lt;span class="k"&gt;done&lt;/span&gt;

&lt;span class="c"&gt;# Secrets via stdin, no shell history&lt;/span&gt;
&lt;span class="nb"&gt;printf&lt;/span&gt; &lt;span class="s1"&gt;'%s'&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;LINE_TOKEN&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | gcloud secrets create LINE_CHANNEL_ACCESS_TOKEN &lt;span class="nt"&gt;--data-file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;-

&lt;span class="c"&gt;# Create Agent (one-time)&lt;/span&gt;
curl &lt;span class="nt"&gt;-sS&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;gcloud auth print-access-token&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-d&lt;/span&gt; @agent-body.json &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="s2"&gt;"https://aiplatform.googleapis.com/v1beta1/projects/line-vertex/locations/global/agents"&lt;/span&gt;

&lt;span class="c"&gt;# Deploy Cloud Run&lt;/span&gt;
gcloud run deploy line-research-bot &lt;span class="nt"&gt;--source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;--timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3600 &lt;span class="nt"&gt;--memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2Gi ...

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The entire process took about 40 minutes — but &lt;strong&gt;30 of those minutes were spent chasing the five pitfalls described below&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Pitfall Log: Five Pre-GA-Specific Issues
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pitfall One: Synchronous Calls → Mysterious &lt;code&gt;RESOURCE_PROJECT_INVALID&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The first time I followed the doc and directly POSTed &lt;code&gt;interactions:create&lt;/code&gt; via REST, it returned this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Invalid resource field value in the request."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"INVALID_ARGUMENT"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"details"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"RESOURCE_PROJECT_INVALID"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"service"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aiplatform.googleapis.com"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I spent a full hour and a half wondering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Project not allowlisted? (Couldn't find where to apply)&lt;/li&gt;
&lt;li&gt;  Use project number or ID? (Tried both, both wrong)&lt;/li&gt;
&lt;li&gt;  Change region? (All wrong)&lt;/li&gt;
&lt;li&gt;  Change agent? (All wrong)&lt;/li&gt;
&lt;li&gt;  Even &lt;code&gt;gemini-2.0-flash:generateContent&lt;/code&gt; returned &lt;code&gt;RESOURCE_PROJECT_INVALID&lt;/code&gt;!&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Until I carefully read the official colab and saw a line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enterprise&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;...,&lt;/span&gt; &lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;...)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It differed from the &lt;code&gt;genai.Client()&lt;/code&gt; we used by one &lt;code&gt;enterprise=True&lt;/code&gt;. Then I ran the colab code and saw:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interactions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;...,&lt;/span&gt;
    &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;background&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;background=True&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I brought this back to REST: wrote SDK + background=True, and it immediately worked:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Chiliagon path must set background to true."&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;background&lt;/code&gt; was not included → 500 with a &lt;code&gt;Chiliagon&lt;/code&gt; message (this is an internal Google codename, not in the doc). If &lt;code&gt;enterprise=True&lt;/code&gt; was not included → routed to an old path not for Pre-GA → then returned &lt;code&gt;RESOURCE_PROJECT_INVALID&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Takeaway&lt;/strong&gt;: Pre-GA Managed Agents API currently &lt;strong&gt;only supports asynchronous calls&lt;/strong&gt;. Actual usage requires:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Using the &lt;code&gt;google-genai&lt;/code&gt; SDK with &lt;code&gt;enterprise=True&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt; &lt;code&gt;interactions.create(background=True, store=True)&lt;/code&gt; to get an interaction ID&lt;/li&gt;
&lt;li&gt; &lt;code&gt;interactions.get(id)&lt;/code&gt; polling until &lt;code&gt;status == "completed"&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Don't waste an hour stubbornly trying raw REST like I did.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pitfall Two: &lt;code&gt;gsutil&lt;/code&gt; in the Sandbox is a &lt;strong&gt;Mock&lt;/strong&gt; (This one is the most insidious)
&lt;/h3&gt;

&lt;p&gt;My LINE Bot was originally designed for the agent to upload HTML to GCS itself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gsutil &lt;span class="nt"&gt;-h&lt;/span&gt; &lt;span class="s2"&gt;"Cache-Control:no-cache, max-age=0"&lt;/span&gt; &lt;span class="nb"&gt;cp&lt;/span&gt; /workspace/report.html &lt;span class="se"&gt;\&lt;/span&gt;
    gs://research-line/&lt;span class="o"&gt;{&lt;/span&gt;report_id&lt;span class="o"&gt;}&lt;/span&gt;/index.html
curl &lt;span class="nt"&gt;-sI&lt;/span&gt; https://storage.googleapis.com/research-line/&lt;span class="o"&gt;{&lt;/span&gt;report_id&lt;span class="o"&gt;}&lt;/span&gt;/index.html

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent finished happily and returned:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"report_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"d4302f31..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"summary_500"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"This report focuses on mainstream open-source vector databases in 2026…"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"top_citations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"new_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;LINE received the Flex card, clicked the button → &lt;strong&gt;404 NoSuchKey&lt;/strong&gt;. GCS was empty.&lt;/p&gt;

&lt;p&gt;I ran a diagnostic interaction to query the sandbox:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interactions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research-planner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Run these and report verbatim:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1. echo &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;X&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &amp;gt; /tmp/diag.html&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2. gcloud auth list 2&amp;gt;&amp;amp;1&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3. gsutil cp /tmp/diag.html gs://research-line/probe.html 2&amp;gt;&amp;amp;1&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4. curl -sI https://storage.googleapis.com/research-line/probe.html&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5. gsutil ls gs://research-line/ 2&amp;gt;&amp;amp;1&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reply ONLY with: {&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;step1&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;, ...}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ENV_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;background&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The returned JSON made me jump out of my chair:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"step2"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"No credentialed accounts.&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;To login, run:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt; $ gcloud auth login..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"step3"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Mock gsutil: simulated copy to cp /tmp/diag.html gs://research-line/..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"step4"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"HTTP/2 200 OK&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"step5"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Mock gsutil: simulated copy to ls gs://research-line/..."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The sandbox has a fake command called "Mock gsutil"&lt;/strong&gt;, which returns "simulated copy" for any parameters and always pretends HTTP 200. &lt;code&gt;gcloud auth list&lt;/code&gt; showed &lt;strong&gt;no credentials&lt;/strong&gt;, so even if there was a real gsutil, it wouldn't have permission to write.&lt;/p&gt;

&lt;p&gt;At that moment, I finally understood — the Pre-GA sandbox &lt;strong&gt;does not provide any GCP authentication&lt;/strong&gt;. &lt;code&gt;gsutil&lt;/code&gt; is a placeholder behavior, and the agent doesn't know the upload failed (because curl also returned 200), so it happily reported success.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Completely refactor the architecture. The agent no longer attempts to upload; instead, the &lt;strong&gt;agent returns the complete markdown via the &lt;code&gt;report_md&lt;/code&gt; field&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# New system_instruction (excerpt)
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
After writing /workspace/report.md, use code_execution to read it back
and return JSON:
{
  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;report_md&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;full contents of /workspace/report.md&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary_500&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
&lt;/span&gt;&lt;span class="gp"&gt;  ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;DO&lt;/span&gt; &lt;span class="n"&gt;NOT&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="n"&gt;gsutil&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="n"&gt;DO&lt;/span&gt; &lt;span class="n"&gt;NOT&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="n"&gt;curl&lt;/span&gt; &lt;span class="n"&gt;on&lt;/span&gt; &lt;span class="n"&gt;storage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;googleapis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;host&lt;/span&gt; &lt;span class="n"&gt;service&lt;/span&gt; &lt;span class="n"&gt;handles&lt;/span&gt; &lt;span class="n"&gt;publishing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then the Cloud Run worker, using a service account with real IAM, takes over:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# app/publisher.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;markdown&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.cloud&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;storage&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;GcsPublisher&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bucket_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;storage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bucket_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;report_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;report_md&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;snapshot_previous&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;snapshot_previous&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;snapshot_previous&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;markdown&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report_md&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;extensions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fenced_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tables&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;footnotes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_wrap_with_css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;blob&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_bucket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;blob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;report_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/index.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;blob&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache_control&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no-cache, max-age=0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;blob&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upload_from_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text/html; charset=utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://storage.googleapis.com/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_bucket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;report_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/index.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Clear division of responsibilities: &lt;strong&gt;the agent is responsible for thinking + writing; Cloud Run is responsible for infra&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Takeaway&lt;/strong&gt;: Do not assume the Pre-GA sandbox can access your GCP resources. For anything that needs to write to external systems, &lt;strong&gt;let the host service do it with a real SA&lt;/strong&gt;, and the agent only returns the payload. By the way, from the forum, it seems that after GA, the sandbox might provide ambient credentials, but not in Pre-GA.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pitfall Three: Cloud Run's &lt;code&gt;/healthz&lt;/code&gt; is Intercepted by Google Frontend
&lt;/h3&gt;

&lt;p&gt;I wrote a &lt;code&gt;/healthz&lt;/code&gt; for Cloud Run health checks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/healthz&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;healthz&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After deployment, I called:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl https://line-research-bot-xxx.run.app/healthz

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It returned &lt;strong&gt;this&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="cp"&gt;&amp;lt;!DOCTYPE html&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;title&amp;gt;&lt;/span&gt;Error 404 (Not Found)!!1&lt;span class="nt"&gt;&amp;lt;/title&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;p&amp;gt;&amp;lt;b&amp;gt;&lt;/span&gt;404.&lt;span class="nt"&gt;&amp;lt;/b&amp;gt;&lt;/span&gt; The requested URL /healthz was not found on this server.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It was &lt;strong&gt;Google Frontend's 404 page&lt;/strong&gt;, not FastAPI's. But &lt;code&gt;/docs&lt;/code&gt;, &lt;code&gt;/webhook&lt;/code&gt;, &lt;code&gt;/openapi.json&lt;/code&gt; all worked. OpenAPI also listed the &lt;code&gt;GET /healthz&lt;/code&gt; route.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;/healthz&lt;/code&gt; is a &lt;strong&gt;special reserved path&lt;/strong&gt; in Cloud Run; Google Frontend intercepts it before the path even reaches the container.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Rename it to &lt;code&gt;/readyz&lt;/code&gt;. Solved in one second.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/readyz&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# /healthz was intercepted, renamed
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;readyz&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pitfall Four: Service Account Needs to &lt;code&gt;actAs&lt;/code&gt; &lt;strong&gt;Itself&lt;/strong&gt; for Cloud Tasks OIDC to Sign
&lt;/h3&gt;

&lt;p&gt;When pushing tasks from the webhook to Cloud Tasks, the task kept dispatching 0 times + dispatchDeadline expired. Cloud Run logs showed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PERMISSION_DENIED: The principal lacks IAM permission "iam.serviceAccounts.actAs"
for the resource "line-bot-sa@line-vertex.iam.gserviceaccount.com"

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I thought giving the SA &lt;code&gt;iam.serviceAccountTokenCreator&lt;/code&gt; was enough, right? &lt;strong&gt;Not enough&lt;/strong&gt;. Cloud Tasks needs to sign an OIDC token for the callback, which requires the SA to have &lt;code&gt;actAs&lt;/code&gt; permission for " &lt;strong&gt;itself&lt;/strong&gt; ":&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
shell
gcloud iam service-accounts add-iam-policy-binding \
    line
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>agents</category>
      <category>api</category>
      <category>gemini</category>
      <category>google</category>
    </item>
    <item>
      <title>Gubernator [the kill ku8s]</title>
      <dc:creator>Mario Ezquerro</dc:creator>
      <pubDate>Tue, 02 Jun 2026 12:55:56 +0000</pubDate>
      <link>https://dev.to/gde/gubernator-the-kill-ku8s-2g0e</link>
      <guid>https://dev.to/gde/gubernator-the-kill-ku8s-2g0e</guid>
      <description>&lt;p&gt;Why I built an alternative to Kubernetes overkill.&lt;/p&gt;

&lt;p&gt;Let’s talk about infrastructure efficiency.&lt;/p&gt;

&lt;p&gt;Kubernetes is the undisputed industry standard for container orchestration, and for massive, complex enterprises, it’s irreplaceable. But for small-to-medium deployments or distributed multi-host edge environments, it often feels like an operational nightmare.&lt;/p&gt;

&lt;p&gt;You install a massive orchestration layer, only to realize it's still missing the basics for actual operations. To get it production-ready, you have to layer on external tools for Ingress routing, plus a heavy stack for observability, health metrics, and SLOs. Suddenly, your infrastructure consumes more resources than your actual applications.&lt;/p&gt;

&lt;p&gt;I believe in radical software minimalism. That’s why I’ve been developing Gubernator (gbnt), an open-source, lightweight distributed container orchestrator written entirely in Go.&lt;/p&gt;

&lt;p&gt;Unlike traditional platforms, Gubernator bakes essential Site Reliability Engineering (SRE) and traffic management right into its core architecture:&lt;/p&gt;

&lt;p&gt;Native Reverse Proxy: No complex third-party Ingress controllers. Gubernator automatically manages routing—like dynamically hooking up web containers to an Ingress layer (e.g., Caddy)—making exposure seamless and native.&lt;/p&gt;

&lt;p&gt;Built-in SRE &amp;amp; Observability: Out-of-the-box support for health monitoring, metrics, and SLO tracking natively leveraging OpenTelemetry and Prometheus.&lt;/p&gt;

&lt;p&gt;Zero-Bloat State Management: It swaps out heavy external key-value stores for an embedded, rock-solid SQLite architecture, ensuring multi-host consistency with a near-zero footprint.&lt;/p&gt;

&lt;p&gt;Gubernator is designed for engineers who want robust, predictable, and resilient orchestration without the overhead and cognitive load of K8s.&lt;/p&gt;

&lt;p&gt;If you are passionate about minimalist backend architecture, systems engineering, or streamlined DevOps, I’d love for you to take a look at the blueprint and documentation:&lt;/p&gt;

&lt;p&gt;Explore the project: &lt;a href="https://mario-ezquerro.github.io/gubernator/" rel="noopener noreferrer"&gt;https://mario-ezquerro.github.io/gubernator/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How do you handle container orchestration for smaller setups? Let’s connect and discuss in the comments!&lt;/p&gt;

&lt;h1&gt;
  
  
  SoftwareEngineering #GoLang #SRE #DevOps #ContainerOrchestration #OpenSource #SystemsArchitecture #Minimalism
&lt;/h1&gt;

</description>
      <category>architecture</category>
      <category>infrastructure</category>
      <category>kubernetes</category>
      <category>showdev</category>
    </item>
    <item>
      <title>I Built an Autonomous AI Agent with Google ADK + Gemini That Spots Trends and Drafts Dev.to Articles for Me</title>
      <dc:creator>Aryan Irani</dc:creator>
      <pubDate>Tue, 02 Jun 2026 09:43:33 +0000</pubDate>
      <link>https://dev.to/gde/i-built-an-autonomous-ai-agent-with-google-adk-gemini-20-flash-that-spots-trends-and-drafts-60</link>
      <guid>https://dev.to/gde/i-built-an-autonomous-ai-agent-with-google-adk-gemini-20-flash-that-spots-trends-and-drafts-60</guid>
      <description>&lt;p&gt;Keeping up with trending technical topics and new tools on developer forums can be time-consuming. To save time, I wanted to automate the process of finding popular articles, reading the comments to understand community sentiment, and drafting a summary.&lt;/p&gt;

&lt;p&gt;While I could write a standard Python script to scrape the dev.to API, simple scripts tend to be brittle. If an article doesn't have comments yet, a basic script will likely crash unless you write extensive error-handling logic.&lt;/p&gt;

&lt;p&gt;Instead of a rigid script, I built an &lt;strong&gt;Agent&lt;/strong&gt;—a program that can dynamically reason about errors and adjust its approach. If one task fails, it can figure out the next best step.&lt;/p&gt;

&lt;p&gt;In this tutorial, I'll show you how to build a Trend-Spotting Agent using Python, the &lt;strong&gt;Google Agent Development Kit (ADK)&lt;/strong&gt;, and Gemini 2.5 Flash.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;We are going to write a Python application that acts as an autonomous agent. We'll give it three abilities:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Search the dev.to API for rising technical articles based on specific tags.&lt;/li&gt;
&lt;li&gt;Dynamically fetch the top comments of those articles to read real community sentiment.&lt;/li&gt;
&lt;li&gt;Automatically draft a newsletter-style article on your DEV.to account summarizing its findings.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Python 3.9+&lt;/strong&gt; installed on your machine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google ADK&lt;/strong&gt;. (Check out the &lt;a href="https://google.github.io/adk-docs/" rel="noopener noreferrer"&gt;Google ADK Docs&lt;/a&gt; if you need help installing).&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;DEV API Key&lt;/strong&gt;. Grab this from your DEV.to account settings under "Extensions" and throw it in a &lt;code&gt;.env&lt;/code&gt; file.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Step 1: Giving the Agent its "Hands" (API Tools)
&lt;/h2&gt;

&lt;p&gt;Large Language Models (LLMs) are incredibly smart, but out of the box, they can't actually &lt;em&gt;do&lt;/em&gt; anything on your computer. &lt;/p&gt;

&lt;p&gt;The coolest part about Google ADK is that we can write standard Python functions, hand them to the LLM as "tools", and let the AI decide how and when to use them.&lt;/p&gt;

&lt;p&gt;Let's write our API functions. &lt;/p&gt;

&lt;h3&gt;
  
  
  Tool 1: Finding Rising Articles
&lt;/h3&gt;

&lt;p&gt;Here is our function to fetch rising articles. Pay close attention to the docstring (&lt;code&gt;"""Fetches the top..."""&lt;/code&gt;). We aren't writing this for other developers; the ADK actually passes this docstring directly to the LLM so it understands exactly what the tool does.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;DEV_API_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DEV_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_rising_articles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Fetches the top rising articles for a specific tag on DEV.to.
    Returns a formatted string of the articles with their ID, title, and URL.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://dev.to/api/articles?tag=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;state=rising&amp;amp;per_page=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;articles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Top &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rising articles for &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;articles&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;. &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (ID: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error fetching articles: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Tool 2: Fetching Community Comments
&lt;/h3&gt;

&lt;p&gt;We don't just want our agent to read article titles; we want it to know what the community thinks. This function fetches the comments for a given article.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_comments&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;article_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Fetches the top comments for a specific DEV.to article by ID to gauge community sentiment.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://dev.to/api/comments?a_id=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;article_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;comments&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;comments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No comments found on this article.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;summary_blocks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="c1"&gt;# Grab up to the top 5 comments
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;comments&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Anonymous&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body_html&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;p&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;/p&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;summary_blocks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Comment &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; by &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Top Comments:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;summary_blocks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Tool 3: The Ghostwriter
&lt;/h3&gt;

&lt;p&gt;Finally, we need a way for the agent to report back. This function takes the agent's research and POSTs it directly to your Dev.to dashboard as an unpublished draft.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_article_draft&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;str, body_markdown: str, tags: list[str]) -&amp;gt; str:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Creates a new, unpublished article Draft on the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s DEV.to account.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://dev.to/api/articles&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DEV_API_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;article&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;published&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Keeps it as a safe draft!
&lt;/span&gt;            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body_markdown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;body_markdown&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;201&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Successfully created draft! URL: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Failed to create draft.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 2: Wiring up the Agent
&lt;/h2&gt;

&lt;p&gt;With our tools ready, we just need to initialize the Google ADK Agent. We give it a persona, drop in our functions, and let it go to work.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.adk.agents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.genai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;

&lt;span class="n"&gt;agent_instruction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
You are a Trend-Spotting Research Agent.
Every morning, or when requested, you monitor specific tags on DEV (Forem) to find the most interesting emerging technologies or topics.

Your workflow:
1. Use the &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;get_rising_articles&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; tool to fetch the top rising articles.
2. Use the &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fetch_comments&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; tool on the most popular articles to evaluate the community&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s sentiment.
3. Synthesize the articles and the sentiment into an insightful trend report.
4. Use the &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;create_article_draft&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; tool to automatically generate an unpublished DEV.to article containing your full synthesized report.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;root_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trend_spotting_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Spots rising trends on DEV community and automatically drafts digest articles.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;agent_instruction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;# Here is where we hand over our Python functions!
&lt;/span&gt;    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;get_rising_articles&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fetch_comments&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;create_article_draft&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;

    &lt;span class="c1"&gt;# Optional: configure automatic HTTP retries just in case the API drops
&lt;/span&gt;    &lt;span class="n"&gt;generate_content_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GenerateContentConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;http_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;HttpOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;retry_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;HttpRetryOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;initial_delay&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attempts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And that is literally all the setup we need. &lt;/p&gt;




&lt;h2&gt;
  
  
  The "Aha!" Moment
&lt;/h2&gt;

&lt;p&gt;To test this out, I spun up the ADK interface in my terminal (&lt;code&gt;adk web&lt;/code&gt;) and gave the agent a fairly complex prompt:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"I need a complete trend analysis for the 'machinelearning' tag. Fetch the top 5 rising articles. Read the comments on the top article to gauge what the community actually thinks. Synthesize this research and draft a new DEV.to article for me titled 'The Community Pulse'."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I watched the terminal logs, and this is where the power of an &lt;strong&gt;Agentic workflow&lt;/strong&gt; really clicked for me. &lt;/p&gt;

&lt;p&gt;The agent easily retrieved the list of articles. It then grabbed the ID of the #1 article and called our &lt;code&gt;fetch_comments(id)&lt;/code&gt; tool. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But there was a problem.&lt;/strong&gt; Because it was a rapidly &lt;em&gt;rising&lt;/em&gt; article, nobody had actually commented on it yet. &lt;/p&gt;

&lt;p&gt;If this was a traditional scripted loop, the program would have just returned an empty array, moved on to the writing phase, and published a broken, empty newsletter. &lt;/p&gt;

&lt;p&gt;But because this is an &lt;em&gt;Agent&lt;/em&gt;, it analyzed the tool's response ("No comments found on this article.") and realized it couldn't complete my request. &lt;strong&gt;Without any prompting from me&lt;/strong&gt;, the logs showed the agent dynamically pivoting:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Okay, no comments on that one. Let's try the next most interesting one, 'Chapter 1: The Value Class'."&lt;/em&gt;&lt;br&gt;
&lt;em&gt;(Calls tool)&lt;/em&gt;&lt;br&gt;
&lt;em&gt;"Okay, no comments there either. Let's try 'Building a GPT From Scratch'."&lt;/em&gt;&lt;br&gt;
&lt;em&gt;(Calls tool)&lt;/em&gt;&lt;br&gt;
&lt;em&gt;"Still no comments. Last try, let's check 'Prove You're 18 Without Showing Who You Are'."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgefabwx9nm7q8uehqm4t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgefabwx9nm7q8uehqm4t.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It iterated through the list, dynamically feeding the output of our tool back into its own reasoning loop until it successfully found a lively debate! It then analyzed the sentiment and proceeded to the final step.&lt;/p&gt;

&lt;p&gt;With the research complete, the agent executed its final tool: &lt;code&gt;create_article_draft()&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;When I opened my DEV.to dashboard, I had a fully formatted Markdown article sitting as an unpublished draft, ready for me to review.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff4ur4p97w5dmwubsudly.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff4ur4p97w5dmwubsudly.png" alt=" " width="800" height="938"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Connecting normal REST APIs to the Google ADK framework completely changes how you think about automation. You don't need to write exhaustive edge-case logic or endless &lt;code&gt;try/except&lt;/code&gt; blocks anymore. &lt;/p&gt;

&lt;p&gt;You just build reliable, single-purpose tools, hand them over, and let the LLM figure out how to navigate the road bumps.&lt;/p&gt;

&lt;p&gt;If you want to see the full source code for this project, you can check it out on my GitHub here: &lt;a href="https://github.com/aryanirani123/Agent-Development-Kit/tree/main/trend-spotting-agent" rel="noopener noreferrer"&gt;Link to Repo&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Let me know in the comments if you've built any interesting agents lately, and I'll see you in the next one!&lt;/p&gt;

</description>
      <category>agentdevelopmentkit</category>
      <category>googlecloud</category>
      <category>devto</category>
      <category>gemini</category>
    </item>
    <item>
      <title>31B — Gemma 4 Deployment with NVIDIA L4, MCP, Cloud Run, and Antigravity CLI</title>
      <dc:creator>xbill</dc:creator>
      <pubDate>Mon, 01 Jun 2026 22:12:15 +0000</pubDate>
      <link>https://dev.to/gde/31b-gemma-4-deployment-with-nvidia-l4-mcp-cloud-run-and-antigravity-cli-2334</link>
      <guid>https://dev.to/gde/31b-gemma-4-deployment-with-nvidia-l4-mcp-cloud-run-and-antigravity-cli-2334</guid>
      <description>&lt;p&gt;This article provides a step by step deployment review for Gemma 4 to a Google Cloud Run hosted GPU enabled system,. A suite of Python MCP tools is built to simplify management of the vLLM hosted Gemma 4 deployment with Antigravity CLI.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fprgzxj1uzsqeaii6cvvv.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fprgzxj1uzsqeaii6cvvv.jpeg" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  What is this project trying to Do?
&lt;/h4&gt;

&lt;p&gt;This project is a DevOps/SRE assistant that uses a Gemma 4 model hosted on Cloud Run with GPU. It provides tools to provision the Docker container and deploy the model, as well as for observability and performance testing.&lt;/p&gt;

&lt;p&gt;This project is similar to a previous project that targeted GPU hosted Gemma4 instances on GCP:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://medium.com/google-cloud/gemma-sre-self-hosted-vllm-infrastructure-agent-afda000c0943" rel="noopener noreferrer"&gt;Gemma-SRE: Self-Hosted vLLM Infrastructure Agent&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Antigravity CLI
&lt;/h4&gt;

&lt;p&gt;Antigravity CLI is the follow-on successor to Gemini CLI- the terminal driven, agent assisted coding tool.&lt;/p&gt;

&lt;p&gt;Full details on installing Antigravity CLI are here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://medium.com/google-cloud/getting-started-with-antigravity-cli-26c5da90951f" rel="noopener noreferrer"&gt;Getting Started with Antigravity CLI&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Testing the Antigravity CLI Environment
&lt;/h4&gt;

&lt;p&gt;Once you have all the tools in place- you can test the startup of Antigravity CLI.&lt;/p&gt;

&lt;p&gt;You will need to authenticate with a Google Cloud Project or your Google Account:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;agy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will start the interface:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6mvg4a6tnyt4j8666lb1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6mvg4a6tnyt4j8666lb1.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Full Installation Instructions
&lt;/h4&gt;

&lt;p&gt;The detailed installation instructions for Antigravity CLI are here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://medium.com/google-cloud/getting-started-with-antigravity-cli-26c5da90951f" rel="noopener noreferrer"&gt;Getting Started with Antigravity CLI&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Python MCP Documentation
&lt;/h4&gt;

&lt;p&gt;The official GitHub Repo provides samples and documentation for getting started:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/modelcontextprotocol/python-sdk" rel="noopener noreferrer"&gt;GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Where do I start?
&lt;/h4&gt;

&lt;p&gt;The strategy for starting MCP development for model management is a incremental step by step approach.&lt;/p&gt;

&lt;p&gt;First, the basic development environment is setup with the required system variables, and a working Antigravity CLI configuration.&lt;/p&gt;

&lt;p&gt;Then, a minimal Python MCP Server is built with stdio transport. This server is validated with Antigravity CLI in the local environment.&lt;/p&gt;

&lt;p&gt;This setup validates the connection from Antigravity CLI to the local server via MCP. The MCP client (Antigravity CLI) and the Python MCP server both run in the same local environment.&lt;/p&gt;

&lt;h4&gt;
  
  
  Setup the Basic Environment
&lt;/h4&gt;

&lt;p&gt;At this point you should have a working Python environment and a working Antigravity CLI installation. The next step is to clone the GitHub samples repository with support scripts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~
git clone https://github.com/xbill9/gemma4-tips
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run &lt;strong&gt;init.sh&lt;/strong&gt; from the cloned directory.&lt;/p&gt;

&lt;p&gt;The script will attempt to determine your shell environment and set the correct variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gpu-31B-L4-devops-agent
&lt;span class="nb"&gt;source &lt;/span&gt;init.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your session times out or you need to re-authenticate- you can run the &lt;strong&gt;set_env.sh&lt;/strong&gt; script to reset your environment variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gpu-31B-L4-devops-agent
&lt;span class="nb"&gt;source &lt;/span&gt;set_env.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Variables like PROJECT_ID need to be setup for use in the various build scripts- so the set_env script can be used to reset the environment if you time-out.&lt;/p&gt;

&lt;h4&gt;
  
  
  Model Management Tool with MCP Stdio Transport
&lt;/h4&gt;

&lt;p&gt;One of the key features that the standard MCP libraries provide is abstracting various transport methods.&lt;/p&gt;

&lt;p&gt;The high level MCP tool implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.&lt;/p&gt;

&lt;p&gt;The simplest transport that the SDK supports is the stdio (stdio/stdout) transport — which connects a locally running process. Both the MCP client and MCP Server must be running in the same environment.&lt;/p&gt;

&lt;p&gt;The connection over stdio will look similar to this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Initialize FastMCP server
&lt;/span&gt;&lt;span class="n"&gt;mcp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastMCP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Self-Hosted vLLM DevOps Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Running the Python Code
&lt;/h4&gt;

&lt;p&gt;First- switch the directory with the Python version of the MCP sample code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/gemma4-tips/gpu-31B-L4-devops-agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the release version on the local system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;make &lt;span class="nb"&gt;install
&lt;/span&gt;Processing ./.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The project can also be linted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;xbill@cloudshell:~/gemma4-tips/gpu-31B-L4-devops-agent &lt;span class="o"&gt;(&lt;/span&gt;aisprint-491218&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;make lint
ruff check &lt;span class="nb"&gt;.&lt;/span&gt;
All checks passed!
ruff format &lt;span class="nt"&gt;--check&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
6 files already formatted
mypy &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And a test run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;xbill@cloudshell:~/gemma4-tips/gpu-31B-L4-devops-agent &lt;span class="o"&gt;(&lt;/span&gt;aisprint-491218&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;make &lt;span class="nb"&gt;test
&lt;/span&gt;python test_agent.py
2026-06-01 00:58:44,717 - vllm-devops-agent - INFO - Initializing DevOps Agent MCP Server...
..2026-06-01 00:58:45,306 - asyncio - WARNING - Executing &amp;lt;Task pending &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'Task-11'&lt;/span&gt; &lt;span class="nv"&gt;coro&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;TestDevOpsAgent.test_deploy_vllm_hf&lt;span class="o"&gt;()&lt;/span&gt; running at /usr/lib/python3.12/unittest/mock.py:1407&amp;gt; &lt;span class="nv"&gt;wait_for&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;Future pending &lt;span class="nv"&gt;cb&lt;/span&gt;&lt;span class="o"&gt;=[&lt;/span&gt;_chain_future.&amp;lt;locals&amp;gt;._call_check_cancel&lt;span class="o"&gt;()&lt;/span&gt; at /usr/lib/python3.12/asyncio/futures.py:387, Task.task_wakeup&lt;span class="o"&gt;()]&lt;/span&gt; created at /usr/lib/python3.12/asyncio/base_events.py:449&amp;gt; &lt;span class="nv"&gt;cb&lt;/span&gt;&lt;span class="o"&gt;=[&lt;/span&gt;_run_until_complete_cb&lt;span class="o"&gt;()&lt;/span&gt; at /usr/lib/python3.12/asyncio/base_events.py:182] created at /usr/lib/python3.12/asyncio/runners.py:100&amp;gt; took 0.521 seconds
.......2026-06-01 00:58:45,325 - vllm-devops-agent - INFO - Querying Cloud Run model with prompt: &lt;span class="s1"&gt;'Hello...'&lt;/span&gt;
2026-06-01 00:58:45,325 - vllm-devops-agent - INFO - Model response: &lt;span class="s1"&gt;'Response from Gemma...'&lt;/span&gt;
.2026-06-01 00:58:45,329 - vllm-devops-agent - INFO - Querying model with stats with prompt: &lt;span class="s1"&gt;'Hello...'&lt;/span&gt;
2026-06-01 00:58:45,329 - vllm-devops-agent - INFO - Model response with stats: &lt;span class="nv"&gt;TTFT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.000s, &lt;span class="nv"&gt;TotalTime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.000s
.......
&lt;span class="nt"&gt;----------------------------------------------------------------------&lt;/span&gt;
Ran 17 tests &lt;span class="k"&gt;in &lt;/span&gt;0.578s

OK
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  MCP stdio Transport
&lt;/h4&gt;

&lt;p&gt;One of the key features that the MCP protocol provides is abstracting various transport methods.&lt;/p&gt;

&lt;p&gt;The high level tool MCP implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.&lt;/p&gt;

&lt;p&gt;The simplest transport that the SDK supports is the stdio (stdio/stdout) transport — which connects a locally running process. Both the MCP client and MCP Server must be running in the same environment.&lt;/p&gt;

&lt;p&gt;In this project Antigravity CLI is used as the MCP client to interact with the Python MCP server code.&lt;/p&gt;

&lt;h4&gt;
  
  
  Antigravity CLI mcp_config.json
&lt;/h4&gt;

&lt;p&gt;A sample MCP server file is provided in the .agents directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"gpu-devops-agent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"/home/xbill/gemma4-tips/gpu-31B-L4-devops-agent/server.py"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"GOOGLE_CLOUD_PROJECT"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aisprint-491218"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"GOOGLE_CLOUD_LOCATION"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"us-east4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"VLLM_BASE_URL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://gpu-31b-l4-devops-agent-289270257791.us-east4.run.app"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"MODEL_NAME"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/mnt/models/gemma-4-31B-it"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Validation with Antigravity CLI
&lt;/h4&gt;

&lt;p&gt;The final connection test uses Antigravity CLI as a MCP client with the Python code providing the MCP server:&lt;/p&gt;

&lt;h4&gt;
  
  
  Getting Started with Gemma 4 on GPU
&lt;/h4&gt;

&lt;p&gt;The Official vLLM repo also has Gemma4 specific information:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/vllm-project/vllm/releases" rel="noopener noreferrer"&gt;Releases · vllm-project/vllm&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Model Lifecycle Management via MCP
&lt;/h4&gt;

&lt;p&gt;The MCP tools provide a complete suite of agent-oriented operations for managing vLLM deployment on Cloud Run or a TPU.&lt;/p&gt;

&lt;p&gt;Overview of MCP tools :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MCP Servers

Plugins (~/.gemini/antigravity-cli/plugins)
&amp;gt; ✓ gpu-devops-agent Tools: save_hf_token, get_vllm_endpoint, list_vertex_models, list_bucket_models,
                       analyze_cloud_logging, +21 more

● gpu-devops-agent/get_help(Call get_help tool) (ctrl+o to expand)

  The active configuration and available DevOps/SRE tools are outlined below.

  ### ⚙️ Active Environment Configuration

  • GCP Project: aisprint-491218 
  • Region: us-east4 
  • Model: /mnt/models/gemma-4-31B-it (quantized at 4-bit, serving on NVIDIA L4)
  • Active Service Endpoint: https://gpu-31b-l4-devops-agent-289270257791.us-east4.run.app 

  ### 🧰 Summary of Available Toolsets

  1. Deployment &amp;amp; Scaling: deploy_vllm , destroy_vllm , status_vllm , update_vllm_scaling ,       
  get_vllm_deployment_config , check_gpu_quotas 
  2. Diagnostics &amp;amp; SRE: analyze_gpu_logs , suggest_sre_remediation , verify_model_health ,         
  get_system_status 
  3. Benchmarking: run_benchmark 
  4. Model Interrogation: query_gemma4 , query_gemma4_with_stats , get_model_details 

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Deploy The Model
&lt;/h4&gt;

&lt;p&gt;The MCP server tools are used to deploy the model to Cloud Run with TPU:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;deploy the model
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Checking System status
&lt;/h4&gt;

&lt;p&gt;The status can be checked with an MCP tool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;● gpu-devops-agent/get_system_status(Call system status tool) (ctrl+o to expand)

  I have successfully retrieved the system status using the get_system_status tool.

  ### 🌀 GPU Cloud Run System Status

  • vLLM Health: 🟢 Online (https://gpu-31b-l4-devops-agent-289270257791.us-east4.run.app)
  • Cloud Run Service Status: 🟢 Ready

  Next Step: You can use the query_gemma4 tool to interact with the model.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Cross Check The Deployed Model
&lt;/h4&gt;

&lt;p&gt;Once the model starts locally- the MCP tool allows for model verification:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; verify_model_health

● Read(/home/xbill/.gemini/antigravity-cli/mcp/gpu-devops-agent/verify_model_health.json)
● gpu-devops-agent/verify_model_health(Verify model health) (ctrl+o to expand)

  I have verified the model's health using the verify_model_health tool.

  ### 🏥 Model Health Check Results

  • Status: ✅ PASSED
  • Model Path: /mnt/models/nvidia/Gemma-4-31B-IT-NVFP4 
  • Response Check: "Yes, I am working! How can I help you today?..." 
  • Latency: 25.92 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and model details:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; get_active_models

● Read(/home/xbill/gemma4-tips/gpu-31B-L4-devops-agent/server.py)
● Search(Grep tool decorators)
● Search(Grep def statements)
● Read(/home/xbill/.gemini/antigravity-cli/mcp/gpu-devops-agent/get_model_details.json)
● gpu-devops-agent/get_model_details(Get active model details) (ctrl+o to expand)

  I have retrieved the active models and model details using the get_model_details tool.

  ### 🧩 Active Model Details

  • Service Endpoint: https://gpu-31b-l4-devops-agent-289270257791.us-east4.run.app
  • Health Status ( /health ): Healthy ✅
  • Active Model:
    [
      {
        "id": "/mnt/models/nvidia/Gemma-4-31B-IT-NVFP4",
        "object": "model",
        "owned_by": "vllm"
      }
    ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and the endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; get_endpoint
● Read(/home/xbill/.gemini/antigravity-cli/mcp/gpu-devops-agent/get_endpoint.json)
● gpu-devops-agent/get_endpoint(Get Cloud Run endpoint) (ctrl+o to expand)

  I have retrieved the active service endpoint using the get_endpoint tool.

  ### 🌐 Cloud Run vLLM Endpoint

  • Status: 🟢 Online
  • Service URL: https://gpu-31b-l4-devops-agent-289270257791.us-east4.run.app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and status:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; get_system_details

● Read(/home/xbill/.gemini/antigravity-cli/mcp/gpu-devops-agent/status_vllm.json)
● Read(/home/xbill/gemma4-tips/gpu-31B-L4-devops-agent/server.py)
● gpu-devops-agent/status_vllm(Call status_vllm tool) (ctrl+o to expand)

  I have retrieved the system status details using the status_vllm tool.

  ### ⚙️ Cloud Run System Details ( gpu-31b-l4-devops-agent )

  • Service URL: https://gpu-31b-l4-devops-agent-wgcq55zbfq-uk.a.run.app
  • Latest Created Revision: gpu-31b-l4-devops-agent-00003-qjn 
  • Conditions:
      • Ready: True (Last Transition: 2026-06-01T01:35:59Z )
      • ConfigurationsReady: True (Last Transition: 2026-06-01T01:21:51Z )
      • RoutesReady: True (Last Transition: 2026-06-01T01:35:59Z )

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Check the Cloud Run Service
&lt;/h4&gt;

&lt;p&gt;Once Deployed — the model is visible as a standard Cloud Run Service:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fed881wrhigpevpjndrf6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fed881wrhigpevpjndrf6.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Review the Model
&lt;/h4&gt;

&lt;p&gt;Check the model details:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; what are the model details?

  The active model is running on vLLM with the following details:

  • Model ID / Path: /mnt/models/nvidia/Gemma-4-31B-IT-NVFP4 (This is the Gemma 4 31B quantized model using
  NVFP4 format, optimized for serving on a single NVIDIA L4 GPU).
  • Owned By: vllm 
  • Health Status: Healthy ✅
  • Active Service Endpoint: https://gpu-31b-l4-devops-agent-289270257791.us-east4.run.app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once everything is online — you can send queries to the Model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; query_gemma4_with_stats what is 2 + 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Benchmark the Local Model
&lt;/h4&gt;

&lt;p&gt;The local model was benchmarked:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The benchmark sweep has completed successfully. Here are the compiled results across all combinations of context
  window sizes and concurrency levels:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The analysis is reviewed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  I have executed the concurrency and context window sweep on the deployed serving stack. The results, detailing
  hardware limits and bottleneck diagnostics, have been compiled.

  Please refer to the following generated files:

  • 📊 Read the complete report: benchmark_report.md
  • 📈 View the raw sweep data: matrix_benchmark_results.csv
  • ⚙️ Inspect the test script: matrix_benchmark.py

  ### Key Decision Point

  To scale to your target configuration of 256 concurrent users and a 16,384 token context window, the single NVIDIA L4
  GPU is physically constrained by VRAM (24GB).
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is visualized:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdlzucqsmy6p7ub4fs994.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdlzucqsmy6p7ub4fs994.png" width="800" height="350"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Comparison with 6000 GPU
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  The comparison between Gemma 4 (26B) (running fully in VRAM) and Gemma 4 (31B) (running with CPU offloading) on a
  single NVIDIA L4 GPU (24GB VRAM) highlights a massive performance gap.

  Here is the comparison visualization:

  🖼 Image: Gemma 4 26B vs 31B Performance Comparison → /home/xbill/.gemini/antigravity-cli/brain/3230df23-fa96-485f-
  afb5-
  4e322b446584/comparison_chart.png

  ### 📊 Performance Metrics Comparison

   Concurrency | 26B Throughput (VRAM) | 31B Throughput (Of… | 26B Latency (… | 31B Latency (Of… | Performance Gap
  -------------|-----------------------|---------------------|----------------|------------------|---------------------
   1 | 39.02 tok/s | 0.31 tok/s | 1.99s | 51.57s | 31B is ~125x slower
   2 | 67.20 tok/s | 0.69 tok/s | 3.43s | 45.55s | 31B is ~97x slower
   4 | 126.08 tok/s | 1.38 tok/s | 3.23s | 45.78s | 31B is ~91x slower
   8 | 115.32 tok/s | 1.38 tok/s | 4.83s | 91.56s | 31B is ~83x slower
   16 | 131.11 tok/s | 1.38 tok/s | 5.95s | 183.13s | 31B is ~95x slower
   32 | 114.58 tok/s | 1.38 tok/s | 9.56s | 366.25s | 31B is ~83x slower
   64 | 119.82 tok/s | 1.38 tok/s | 8.77s | 732.50s | 31B is ~87x slower
   128 | 115.47 tok/s | 1.38 tok/s | 10.40s | 1465.00s | 31B is ~83x slower
   256 | 121.30 tok/s | 1.38 tok/s | 7.62s | 2930.01s | 31B is ~88x slower
  ──────
  ### 🔍 Key Architectural Takeaways

  1. Weight Offload Overhead (PCIe Bottleneck):
      • The Gemma 4 (26B) model quantized to 4-bit occupies ~13GB, fitting comfortably inside the 24GB L4 VRAM without
      offloading. This enables it to run at peak speed, achieving 120+ tokens/second via vLLM's continuous batching.
      • The Gemma 4 (31B) model requires ~16GB. After accounting for runtime memory allocation and KV cache, it
      triggers CPU weight offloading ( --cpu-offload-gb 14 ). Weights must be constantly moved back and forth between
      system RAM and the GPU over the PCIe bus, dropping throughput to ~1.38 tokens/second and increasing base latency
      from 1.99s to 51.57s.
  2. Context and Concurrency Scaling:
      • Gemma 4 (26B) is configured with --max-model-len 4096 and --max-num-seqs 16 , letting it serve context sizes
      up to 4,096 tokens and concurrency up to 16 concurrently without queuing or errors.
      • Gemma 4 (31B) is strictly limited to --max-model-len 512 and --max-num-seqs 4 to prevent VRAM OOM crashes.
      Requests exceeding these bounds either fail instantly with HTTP 400 Bad Request or scale latency linearly to ~2,
      930s under high concurrency due to queuing.

  Scripts used for this analysis:

  • 📊 Comparison Generation: compare_benchmarks.py
  • 📈 Saved comparison plot: comparison_chart.png
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpu781yegyt99t6027371.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpu781yegyt99t6027371.png" width="800" height="350"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So what does all this mean?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  The Gemma 4 (26B) model configuration is indeed fully supported as a distinct, first-class deployment package in the
  workspace under gpu-26B-L4-devops-agent.

  With that context, the 26B configuration works significantly better for single-GPU NVIDIA L4 deployments.

  Here is the breakdown of why it is the superior choice:

  ### 1. 90x - 125x Higher Throughput (VRAM Serving)

  Because the 26B model fits entirely inside the L4's 24GB VRAM, it runs at full speed without needing CPU offloading:

  • 26B: 39.0 - 126.0 tokens/second
  • 31B: 0.3 - 1.38 tokens/second

  ### 2. Sub-Second to Single-Digit Latencies

  • 26B: Average response latency stays between 1.9s and 9.5s across all concurrency levels.
  • 31B: Latency is 45.78s at low load and scales to 2930.01s (50 minutes) under queue pressure.

  ### 3. Full 4,096 Token Context Window

  • 26B: Fully supports a 4,096 token context window (with --max-model-len=4096 ), allowing it to ingest actual system
  log snippets and traces.
  • 31B: Is constrained to --max-model-len=512 , which is too small for most real-world log diagnostics.

  ### Summary Verdict

  For single-GPU (NVIDIA L4) environments, the gpu-26B-L4-devops-agent agent is the better and more viable model,
delivering
  high
  performance and a usable context window. The gpu-31B-L4-devops-agent agent should only be selected if you plan to scale
the
  infrastructure to multi-GPU or TPU clusters.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Summary
&lt;/h4&gt;

&lt;p&gt;The strategy for using MCP for Gemma 4 GPU deployment with Antigravity CLI was validated with a incremental step by step approach.&lt;/p&gt;

&lt;p&gt;A minimal stdio transport MCP Server was started from Python source code and validated with Antigravity CLI running as a MCP client in the same local environment.&lt;/p&gt;

&lt;p&gt;A detailed analysis compared the 31B vs the 26B deployment confirmed- that even though the 31B model can be run in the 24GB GPU memory with various techniques- it becomes unstable. In order to use the 31B model- more resources are needed.&lt;/p&gt;

</description>
      <category>mcps</category>
      <category>antigravitycli</category>
      <category>googlecloudrun</category>
      <category>gemma4</category>
    </item>
    <item>
      <title>26B Gemma 4 Deployment with NVIDIA L4, MCP, Cloud Run, and Antigravity CLI</title>
      <dc:creator>xbill</dc:creator>
      <pubDate>Mon, 01 Jun 2026 19:42:06 +0000</pubDate>
      <link>https://dev.to/gde/26b-gemma-4-deployment-with-nvidia-l4-mcp-cloud-run-and-antigravity-cli-194p</link>
      <guid>https://dev.to/gde/26b-gemma-4-deployment-with-nvidia-l4-mcp-cloud-run-and-antigravity-cli-194p</guid>
      <description>&lt;p&gt;This article provides a step by step deployment guide for Gemma 4 to a Google Cloud Run hosted GPU enabled system,. A suite of Python MCP tools is built to simplify management of the vLLM hosted Gemma 4 deployment with Antigravity CLI.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fprgzxj1uzsqeaii6cvvv.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fprgzxj1uzsqeaii6cvvv.jpeg" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  What is this project trying to Do?
&lt;/h4&gt;

&lt;p&gt;This project is a DevOps/SRE assistant that uses a Gemma 4 model hosted on Cloud Run with GPU. It provides tools to provision the Docker container and deploy the model, as well as for observability and performance testing.&lt;/p&gt;

&lt;p&gt;This project is similar to a previous project that targeted GPU hosted Gemma4 instances on GCP:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://medium.com/google-cloud/gemma-sre-self-hosted-vllm-infrastructure-agent-afda000c0943" rel="noopener noreferrer"&gt;Gemma-SRE: Self-Hosted vLLM Infrastructure Agent&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Antigravity CLI
&lt;/h4&gt;

&lt;p&gt;Antigravity CLI is the follow-on successor to Gemini CLI- the terminal driven, agent assisted coding tool.&lt;/p&gt;

&lt;p&gt;Full details on installing Antigravity CLI are here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://medium.com/google-cloud/getting-started-with-antigravity-cli-26c5da90951f" rel="noopener noreferrer"&gt;Getting Started with Antigravity CLI&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Testing the Antigravity CLI Environment
&lt;/h4&gt;

&lt;p&gt;Once you have all the tools in place- you can test the startup of Antigravity CLI.&lt;/p&gt;

&lt;p&gt;You will need to authenticate with a Google Cloud Project or your Google Account:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;agy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will start the interface:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6mvg4a6tnyt4j8666lb1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6mvg4a6tnyt4j8666lb1.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Full Installation Instructions
&lt;/h4&gt;

&lt;p&gt;The detailed installation instructions for Antigravity CLI are here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://medium.com/google-cloud/getting-started-with-antigravity-cli-26c5da90951f" rel="noopener noreferrer"&gt;Getting Started with Antigravity CLI&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Python MCP Documentation
&lt;/h4&gt;

&lt;p&gt;The official GitHub Repo provides samples and documentation for getting started:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/modelcontextprotocol/python-sdk" rel="noopener noreferrer"&gt;GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Where do I start?
&lt;/h4&gt;

&lt;p&gt;The strategy for starting MCP development for model management is a incremental step by step approach.&lt;/p&gt;

&lt;p&gt;First, the basic development environment is setup with the required system variables, and a working Antigravity CLI configuration.&lt;/p&gt;

&lt;p&gt;Then, a minimal Python MCP Server is built with stdio transport. This server is validated with Antigravity CLI in the local environment.&lt;/p&gt;

&lt;p&gt;This setup validates the connection from Antigravity CLI to the local server via MCP. The MCP client (Antigravity CLI) and the Python MCP server both run in the same local environment.&lt;/p&gt;

&lt;h4&gt;
  
  
  Setup the Basic Environment
&lt;/h4&gt;

&lt;p&gt;At this point you should have a working Python environment and a working Antigravity CLI installation. The next step is to clone the GitHub samples repository with support scripts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~
git clone https://github.com/xbill9/gemma4-tips
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run &lt;strong&gt;init.sh&lt;/strong&gt; from the cloned directory.&lt;/p&gt;

&lt;p&gt;The script will attempt to determine your shell environment and set the correct variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gpu-26B-L4-devops-agent
&lt;span class="nb"&gt;source &lt;/span&gt;init.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your session times out or you need to re-authenticate- you can run the &lt;strong&gt;set_env.sh&lt;/strong&gt; script to reset your environment variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gpu-26B-L4-devops-agent
&lt;span class="nb"&gt;source &lt;/span&gt;set_env.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Variables like PROJECT_ID need to be setup for use in the various build scripts- so the set_env script can be used to reset the environment if you time-out.&lt;/p&gt;

&lt;h4&gt;
  
  
  Model Management Tool with MCP Stdio Transport
&lt;/h4&gt;

&lt;p&gt;One of the key features that the standard MCP libraries provide is abstracting various transport methods.&lt;/p&gt;

&lt;p&gt;The high level MCP tool implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.&lt;/p&gt;

&lt;p&gt;The simplest transport that the SDK supports is the stdio (stdio/stdout) transport — which connects a locally running process. Both the MCP client and MCP Server must be running in the same environment.&lt;/p&gt;

&lt;p&gt;The connection over stdio will look similar to this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Initialize FastMCP server
&lt;/span&gt;&lt;span class="n"&gt;mcp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastMCP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Self-Hosted vLLM DevOps Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Running the Python Code
&lt;/h4&gt;

&lt;p&gt;First- switch the directory with the Python version of the MCP sample code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;~/gemma4-tips/gpu-26B-L4-devops-agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the release version on the local system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;make &lt;span class="nb"&gt;install
&lt;/span&gt;Processing ./.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The project can also be linted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;xbill@penguin:~/gemma4-tips/gpu-26B-L4-devops-agent&lt;span class="nv"&gt;$ &lt;/span&gt;make lint
ruff check &lt;span class="nb"&gt;.&lt;/span&gt;
All checks passed!
ruff format &lt;span class="nt"&gt;--check&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
6 files already formatted
mypy &lt;span class="nb"&gt;.&lt;/span&gt;
Success: no issues found &lt;span class="k"&gt;in &lt;/span&gt;6 &lt;span class="nb"&gt;source &lt;/span&gt;files
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And a test run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;xbill@penguin:~/gemma4-tips/gpu-26B-L4-devops-agent&lt;span class="nv"&gt;$ &lt;/span&gt;make &lt;span class="nb"&gt;test
&lt;/span&gt;python test_agent.py
2026-05-29 19:07:48,275 - vllm-devops-agent - INFO - Initializing DevOps Agent MCP Server...
.........2026-05-29 19:07:48,318 - vllm-devops-agent - INFO - Querying Cloud Run model with prompt: &lt;span class="s1"&gt;'Hello...'&lt;/span&gt;
2026-05-29 19:07:48,318 - vllm-devops-agent - INFO - Model response: &lt;span class="s1"&gt;'Response from Gemma...'&lt;/span&gt;
.2026-05-29 19:07:48,319 - vllm-devops-agent - INFO - Querying model with stats with prompt: &lt;span class="s1"&gt;'Hello...'&lt;/span&gt;
2026-05-29 19:07:48,320 - vllm-devops-agent - INFO - Model response with stats: &lt;span class="nv"&gt;TTFT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.000s, &lt;span class="nv"&gt;TotalTime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.000s
.......
&lt;span class="nt"&gt;----------------------------------------------------------------------&lt;/span&gt;
Ran 17 tests &lt;span class="k"&gt;in &lt;/span&gt;0.024s

OK
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  MCP stdio Transport
&lt;/h4&gt;

&lt;p&gt;One of the key features that the MCP protocol provides is abstracting various transport methods.&lt;/p&gt;

&lt;p&gt;The high level tool MCP implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.&lt;/p&gt;

&lt;p&gt;The simplest transport that the SDK supports is the stdio (stdio/stdout) transport — which connects a locally running process. Both the MCP client and MCP Server must be running in the same environment.&lt;/p&gt;

&lt;p&gt;In this project Antigravity CLI is used as the MCP client to interact with the Python MCP server code.&lt;/p&gt;

&lt;h4&gt;
  
  
  Antigravity CLI mcp_config.json
&lt;/h4&gt;

&lt;p&gt;A sample MCP server file is provided in the .agents directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"gpu-devops-agent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"/home/xbill/gemma4-tips/gpu-26B-L4-devops-agent/server.py"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"GOOGLE_CLOUD_PROJECT"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aisprint-491218"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"GOOGLE_CLOUD_LOCATION"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"us-east4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"VLLM_BASE_URL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://gpu-26b-l4-devops-agent-289270257791.us-east4.run.app"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"MODEL_NAME"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/mnt/models/gemma-4-26B-A4B-it"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Validation with Antigravity CLI
&lt;/h4&gt;

&lt;p&gt;The final connection test uses Antigravity CLI as a MCP client with the Python code providing the MCP server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MCP Servers

Plugins (~/.gemini/antigravity-cli/plugins)
&amp;gt; ✓ google-dev-knowledge Tools: search_documents, answer_query, get_documents
   ✓ gpu-devops-agent Tools: save_hf_token, get_vllm_endpoint, list_vertex_models, list_bucket_models,
                       analyze_cloud_logging, +21 more
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Getting Started with Gemma 4 on GPU
&lt;/h4&gt;

&lt;p&gt;The Official vLLM repo also has Gemma4 specific information:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/vllm-project/vllm/releases" rel="noopener noreferrer"&gt;Releases · vllm-project/vllm&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Model Lifecycle Management via MCP
&lt;/h4&gt;

&lt;p&gt;The MCP tools provide a complete suite of agent-oriented operations for managing vLLM deployment on Cloud Run or a TPU.&lt;/p&gt;

&lt;p&gt;Overview of MCP tools :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; get_help

  Here is the help documentation and active environment configuration for the GPU vLLM DevOps Agent:

  ### ⚙️ Active Environment Configuration

  • Project ID ( GOOGLE_CLOUD_PROJECT ): aisprint-491218 
  • Region ( GOOGLE_CLOUD_LOCATION ): us-east4 
  • Model Name ( MODEL_NAME ): /mnt/models/gemma-4-26B-A4B-it 
  • GCS Bucket ( BUCKET_NAME ): aisprint-491218-bucket 
  • vLLM Endpoint ( VLLM_BASE_URL ): https://gpu-26b-l4-devops-agent-289270257791.us-east4.run.app 
  • Active Mode: Running in CLOUD RUN mode targeting NVIDIA L4 GPU in us-east4 .
  ──────
  ### 🧰 Available MCP Tools

  #### 🐳 Infrastructure &amp;amp; Deployment

  • deploy_vllm : Deploys vLLM to Cloud Run GPU (NVIDIA L4 in us-east4 ).
  • destroy_vllm : Deletes the Cloud Run vLLM service.
  • status_vllm : Checks the status of the Cloud Run vLLM service.
  • update_vllm_scaling : Updates min/max instances for scaling.
  • get_vllm_deployment_config : Generates the gcloud deployment command.
  • get_vllm_gpu_deployment_config : Generates a GKE manifest for GPU (NVIDIA L4).
  • check_gpu_quotas : Checks L4 and other GPU quotas for a region.

  #### 📦 Model Management

  • list_vertex_models : Lists models in the Vertex AI Registry.
  • list_bucket_models : Lists model weights in the GCS bucket.
  • save_hf_token : Securely saves a Hugging Face API token to Secret Manager.
  • get_vertex_ai_model_copy_instructions : Instructions to copy a model from Vertex AI Model Garden to GCS.
  • get_huggingface_model_copy_instructions : Instructions to download a model from Hugging Face and upload it to GCS.
  • get_huggingfacehub_download_path : Resolves the local cache path using huggingface_hub .

  #### 📊 Monitoring &amp;amp; Status

  • get_system_status : Provides a high-level status dashboard of the Cloud Run service and health.
  • get_endpoint : Verifies connectivity and returns the active service URL.
  • get_model_details : Retrieves detailed model metadata and engine state from /v1/models .
  • verify_model_health : Deep health check by querying the model with a simple prompt and measuring latency.

  #### 📈 Performance &amp;amp; Benchmarking

  • run_benchmark : Runs performance/concurrency benchmark sweeps against the Cloud Run vLLM GPU endpoint.

  #### 💬 Interaction &amp;amp; Diagnostics

  • query_gemma4 : Primary tool to query the self-hosted model with standard chat message format.
  • query_gemma4_with_stats : Queries the model and returns streaming performance statistics (TTFT, throughput).
  • query_vllm : Direct text completions querying tool.
  • analyze_cloud_logging : Fetches logs from GCP Logging and analyzes them using the model.
  • analyze_gpu_logs : Fetches Cloud Run logs and uses Gemma 4 to analyze them for SRE/DevOps errors.
  • suggest_sre_remediation : Suggests remediation plans for SRE errors using the model.
  ──────
  ### 📂 Key Source Code Reference Files

  • MCP Server Implementation: server.py
  • Project Documentation: README.md
  • Agent Test Suite: test_agent.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Deploy The Model
&lt;/h4&gt;

&lt;p&gt;The MCP server tools are used to deploy the model to Cloud Run with TPU:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;deploy the model
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Checking System status
&lt;/h4&gt;

&lt;p&gt;The status can be checked with an MCP tool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; get_system_status

● gpu-devops-agent/get_system_status(System status check) (ctrl+o to expand)

  The current system status of the GPU Cloud Run deployment:

  • vLLM Health: 🟢 Online (gpu-26b-l4-devops-agent https://gpu-26b-l4-devops-agent-289270257791.us-east4.run.app)
  • Cloud Run Service Status: 🟢 Ready

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Cross Check The Deployed Model
&lt;/h4&gt;

&lt;p&gt;Once the model starts locally- the MCP tool allows for model verification:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; verify_model_health

● gpu-devops-agent/verify_model_health(Model health verification) (ctrl+o to expand)

  The model health check has passed successfully:

  • Status: ✅ PASSED
  • Model: nvidia/gemma-4-26B-A4B-NVFP4 
  • Response: "Yes, I am working correctly. How can I help you to..."
  • Latency: 0.44 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and model details:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; get_active_models

● gpu-devops-agent/get_model_details(Get model details) (ctrl+o to expand)

  The active model currently served on the endpoint is:

  • Model ID: nvidia/gemma-4-26B-A4B-NVFP4 
  • Owned By: vllm 
  • Health Status: Healthy (🟢 /health check passed)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and the endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; get_endpoint

● gpu-devops-agent/get_endpoint(Get endpoint URL) (ctrl+o to expand)

  The active vLLM endpoint is:

  🟢 Online: https://gpu-26b-l4-devops-agent-289270257791.us-east4.run.app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and status:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; &amp;gt; get_system_details

● gpu-devops-agent/get_help(Get SRE help details) (ctrl+o to expand)

  The active system configuration and details for the DevOps agent are as follows:

  ### ⚙️ Environment Configuration

  • GCP Project ID ( GOOGLE_CLOUD_PROJECT ): aisprint-491218 
  • GCP Region ( GOOGLE_CLOUD_LOCATION ): us-east4 
  • GCS Bucket ( BUCKET_NAME ): aisprint-491218-bucket 
  • Model Name ( MODEL_NAME ): /mnt/models/gemma-4-26B-A4B-it 
  • vLLM Service URL ( VLLM_BASE_URL ): https://gpu-26b-l4-devops-agent-289270257791.us-east4.run.app

  ### ℹ️ Active Mode

  The agent is running in CLOUD RUN mode, targeting an NVIDIA L4 GPU in the us-east4 region.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Check the Cloud Run Service
&lt;/h4&gt;

&lt;p&gt;Once Deployed — the model is visible as a standard Cloud Run Service:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9z23j1zv3je866ufoht4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9z23j1zv3je866ufoht4.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Review the Model
&lt;/h4&gt;

&lt;p&gt;Check the model details:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; what are the model details?

● gpu-devops-agent/get_model_details(Get model details) (ctrl+o to expand)

  The details of the active model are:

  • Model ID: nvidia/gemma-4-26B-A4B-NVFP4 (Gemma 4 26B model quantized/optimized by NVIDIA)
  • Engine: vllm 
  • Health Status: Healthy ✅
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once everything is online — you can send queries to the Model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; query_gemma4_with_stats what is gemma

● gpu-devops-agent/query_gemma4_with_stats(Query Gemma 4 and get metrics) (ctrl+o to expand)

  Here is the response from Gemma 4 along with the streaming performance stats:

  ### 📊 Performance Statistics

  • Model: nvidia/gemma-4-26B-A4B-NVFP4 
  • Time to First Token (TTFT): 0.142s 
  • Total Generation Time: 17.671s 
  • Throughput: 39.71 tokens/s 
  • Total Tokens (approx.): 696 
  ──────
  ### 💬 Model Response

  Gemma is a family of lightweight, state-of-the-art open-weights models developed by Google DeepMind.

  While Gemini is Google's massive, closed-source model served via APIs and web interfaces, Gemma is the developer-
  focused sibling designed to run locally on your own hardware (laptops, workstations, or private cloud servers).

  #### Key Highlights:

  1. Open Weights: You can download the model parameters, inspect them, fine-tune them for specific tasks, and run them
  completely offline.
  2. Gemini DNA: Built using the same research, datasets, and infrastructure as Gemini, enabling high performance in
  reasoning, coding, and math for its size.
  3. Lightweight &amp;amp; Efficient: Designed to run on consumer-grade hardware (GPUs/TPUs) without requiring data-center-
  scale infrastructure.
  4. Variants: Includes specialized flavors like CodeGemma (for programming) and PaliGemma (vision-language).
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Benchmark the Local Model
&lt;/h4&gt;

&lt;p&gt;The local model was benchmarked:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The benchmark sweep has completed successfully. Here are the compiled results across all combinations of context
  window sizes and concurrency levels:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The analysis is reviewed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  ### 📊 GPU Benchmark Results

  Model: nvidia/gemma-4-26B-A4B-NVFP4 

   Concurrency | Success Rate | Req/s | Tokens/s | Avg Latency | P95 Latency
  -------------------|-------------------|-------------------|-------------------|------------------|------------------
   1 | 100.0% | 0.50 | 39.02 | 1.99s | 3.29s
   2 | 100.0% | 0.58 | 67.20 | 3.43s | 5.54s
   4 | 100.0% | 1.23 | 126.08 | 3.23s | 4.08s
   8 | 100.0% | 1.28 | 115.32 | 4.83s | 8.01s
   16 | 100.0% | 1.69 | 131.11 | 5.95s | 11.62s
   32 | 100.0% | 1.05 | 114.58 | 9.56s | 19.04s
   64 | 100.0% | 1.24 | 119.82 | 8.77s | 16.11s
   128 | 100.0% | 1.06 | 115.47 | 10.40s | 18.89s
   256 | 100.0% | 1.26 | 121.30 | 7.62s | 15.90s

  Analyzing Benchmark Results
  I have created a comprehensive, deep-dive SRE analysis of the benchmark results and saved it as an artifact.

  You can read the full report here: benchmark_analysis.md

  ### 📈 Key Takeaways from the Analysis

  • Optimal Concurrency Sweet-Spot (16 Users): Peak throughput is achieved at 16 concurrent users yielding 131.11      
  Tokens/s with an acceptable average latency of 5.95s.
  • Hardware Saturation Limit: The single NVIDIA L4 GPU saturates at approximately 120–130 Tokens/s for all concurrency
  levels $\ge 4$. Adding more concurrent load increases queuing latency rather than throughput.
  • Robust Stability: The deployment maintained a 100% success rate even at 256 concurrent users, demonstrating
  excellent queuing and request-scheduling stability under heavy load.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is visualized:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffj1gh7gcgdjo93aounrc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffj1gh7gcgdjo93aounrc.png" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Comparison with 6000 GPU
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; Here is a comparison between the new benchmark results (from our latest sweep up to 256 concurrency) and the previous
  benchmark results documented in benchmark_report.md:

  ### 📊 Comparative Analysis Table

   Metric | Previous Benchmark Result… | New Benchmark Results (our… | Key Differences &amp;amp; SRE Obse…
  ----------------------------|----------------------------|-----------------------------|-----------------------------
   Peak Throughput | 80 to 90 tokens/second (at | 131.11 tokens/second (at 16 | ~45% increase in peak       
                              | smaller context windows) | concurrency) | throughput. This suggests
                              | | | the vLLM engine is running
                              | | | with higher batch sizes or
                              | | | optimized FP8 KV cache
                              | | | utilization compared to the
                              | | | previous run.
   Average Latency (Low | Sub-second (&amp;lt; 1.0s) | 1.99s average (at 1 | Higher base latency in the  
   Concurrency) | average | concurrency) | new run. The new test
                              | | | generated a fixed 128 
                              | | | token output length,
                              | | | whereas the previous
                              | | | benchmark might have had a
                              | | | much smaller target output
                              | | | token length.
   Average Latency (High | Under 1.0s average at 128 | 10.40s average at 128 | Significantly higher high-
   Concurrency) | concurrency | concurrency | load latency. At 128
                              | | | concurrent requests
                              | | | generating 128 tokens each,
                              | | | the GPU queue fills up,
                              | | | increasing average response
                              | | | time to 10.4 seconds.
   Maximum Concurrency Tested | Up to 128 users | Up to 256 users | Successfully tested higher
                              | | | load. The engine handled
                              | | | 256 users without error
                              | | | (100% success rate),
                              | | | maintaining a throughput of
                              | | | 121.30 tokens/s.

  ### 🔍 SRE Takeaway

  The newer serving configuration on the NVIDIA L4 is tuned to maximize aggregate throughput (yielding up to 131.11    
  tokens/s at high concurrency), but this comes at the cost of higher latency per request under concurrent load due to
  the increased execution and queue time for generating up to 128 tokens.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Cost per Token
&lt;/h4&gt;

&lt;p&gt;The current Google Cloud pricing is used to evaluate the cost per token:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  Estimating Compute Costs
  To compare the cost efficiency (cost per million tokens) between the NVIDIA L4 and the NVIDIA RTX PRO 6000           
  (Blackwell) on Google Cloud Run, we can construct an SRE cost model.

  GPU billing on Cloud Run is billed per second, but requires minimum allocations of vCPUs and memory.
  ──────
  ### 1. Hourly Resource Cost Estimates (On-Demand)

   Resource component | NVIDIA L4 Stack | NVIDIA RTX PRO 6000 Stack
  ---------------------------------|-----------------------------------------|-----------------------------------------
   GPU Rate | ~$0.55 / hour | ~$2.50 / hour
   Required vCPUs | 4 vCPU (Min) $\rightarrow$ ~$0.184 / hr | 20 vCPU (Min) $\rightarrow$ ~$0.92 / hr
   Required Memory | 16 GiB (Min) $\rightarrow$ ~$0.08 / hr | 80 GiB (Min) $\rightarrow$ ~$0.40 / hr
   Total Estimated Stack Cost | ~$0.81 / hour ($0.000226/sec) | ~$3.82 / hour ($0.001061/sec)
  ──────
  ### 2. Cost Per Token Comparison (At Peak Load)

  Assuming the L4 runs at the peak performance observed in our sweep (~131 Tokens/s) and estimating the RTX PRO 6000
  Blackwell to run at ~2.5x the throughput (~327 Tokens/s) due to its larger VRAM bandwidth and compute:

   Metric | NVIDIA L4 | NVIDIA RTX PRO 6000
  ---------------------------------------|--------------------------------------|--------------------------------------
   Peak Throughput | 131.11 Tokens/s | ~327.50 Tokens/s (Estimated)
   Hourly Stack Cost | $0.81 / hour | $3.82 / hour
   Cost per Second | $0.000226 | $0.001061
   Cost per Million Tokens | ~$1.72 | ~$3.24
  ──────
  ### 🔍 Cost Analysis &amp;amp; Recommendation

  1. The L4 is the Budget-Friendly Champion:
      • For standard batch sizes and latency tolerance, the NVIDIA L4 is roughly ~47% more cost-effective per token.
      This is because the RTX 6000 requires a mandatory minimum allocation of 20 vCPUs and 80 GB RAM, which
      significantly inflates the baseline stack cost.
  2. When to Choose the RTX PRO 6000:
      • Use the RTX PRO 6000 if you require its 96 GB VRAM capacity for running unquantized/larger model variants (like
      the full Gemma 4 31B/72B) or need extremely low latency/high throughput on very large sequence context tasks
      ($\ge 32k$ tokens).
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Summary
&lt;/h4&gt;

&lt;p&gt;The strategy for using MCP for Gemma 4 GPU deployment with Antigravity CLI was validated with a incremental step by step approach.&lt;/p&gt;

&lt;p&gt;A minimal stdio transport MCP Server was started from Python source code and validated with Antigravity CLI running as a MCP client in the same local environment. This Python server provided all of the management tools to deploy and troubleshoot Cloud Run Model deployments.&lt;/p&gt;

</description>
      <category>antigravitycli</category>
      <category>mcps</category>
      <category>nvidial4</category>
      <category>gemma4</category>
    </item>
    <item>
      <title>Local Mac Gemma 4 Deployment with MCP and Antigravity CLI</title>
      <dc:creator>xbill</dc:creator>
      <pubDate>Sun, 31 May 2026 19:03:09 +0000</pubDate>
      <link>https://dev.to/gde/local-mac-gemma-4-deployment-with-mcp-and-antigravity-cli-l0p</link>
      <guid>https://dev.to/gde/local-mac-gemma-4-deployment-with-mcp-and-antigravity-cli-l0p</guid>
      <description>&lt;p&gt;This article provides a step by step deployment guide for Gemma 4 to a M3 Macbook Air. A suite of Python MCP tools is built to simplify management of the Ollama hosted Gemma 4 deployment with Antigravity CLI.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foo01u063f16n67y0xpzt.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foo01u063f16n67y0xpzt.jpeg" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  What is this project trying to Do?
&lt;/h4&gt;

&lt;p&gt;This project is a DevOps/SRE assistant that uses a Gemma 4 model self-hosted locally. It provides tools to deploy the model, as well as for observability and performance testing.&lt;/p&gt;

&lt;p&gt;This project is similar to a previous project that targeted TPU hosted Gemma4 instances on GCP:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://medium.com/google-cloud/self-hosted-gemma-4-on-tpu-with-mcp-adk-and-gemini-cli-7f646458a3c3" rel="noopener noreferrer"&gt;Self-hosted Gemma 4 on TPU with vLLM, MCP, ADK, and Gemini CLI&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Antigravity CLI
&lt;/h4&gt;

&lt;p&gt;Antigravity CLI is the follow-on successor to Gemini CLI- the terminal driven, agent assisted coding tool.&lt;/p&gt;

&lt;p&gt;Full details on installing Antigravity CLI are here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://medium.com/google-cloud/getting-started-with-antigravity-cli-26c5da90951f" rel="noopener noreferrer"&gt;Getting Started with Antigravity CLI&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Testing the Antigravity CLI Environment
&lt;/h4&gt;

&lt;p&gt;Once you have all the tools in place- you can test the startup of Antigravity CLI.&lt;/p&gt;

&lt;p&gt;You will need to authenticate with a Google Cloud Project or your Google Account:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;agy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will start the interface:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6mvg4a6tnyt4j8666lb1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6mvg4a6tnyt4j8666lb1.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Full Installation Instructions
&lt;/h4&gt;

&lt;p&gt;The detailed installation instructions for Antigravity CLI are here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://medium.com/google-cloud/getting-started-with-antigravity-cli-26c5da90951f" rel="noopener noreferrer"&gt;Getting Started with Antigravity CLI&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Python MCP Documentation
&lt;/h4&gt;

&lt;p&gt;The official GitHub Repo provides samples and documentation for getting started:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/modelcontextprotocol/python-sdk" rel="noopener noreferrer"&gt;GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Where do I start?
&lt;/h4&gt;

&lt;p&gt;The strategy for starting MCP development for model management is a incremental step by step approach.&lt;/p&gt;

&lt;p&gt;First, the basic development environment is setup with the required system variables, and a working Antigravity CLI configuration.&lt;/p&gt;

&lt;p&gt;Then, a minimal Python MCP Server is built with stdio transport. This server is validated with Antigravity CLI in the local environment.&lt;/p&gt;

&lt;p&gt;This setup validates the connection from Antigravity CLI to the local server via MCP. The MCP client (Antigravity CLI) and the Python MCP server both run in the same local environment.&lt;/p&gt;

&lt;h4&gt;
  
  
  Setup the Basic Environment
&lt;/h4&gt;

&lt;p&gt;At this point you should have a working Python environment and a working Antigravity CLI installation. The next step is to clone the GitHub samples repository with support scripts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~
git clone https://github.com/xbill9/gemma4-tips
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run &lt;strong&gt;init.sh&lt;/strong&gt; from the cloned directory.&lt;/p&gt;

&lt;p&gt;The script will attempt to determine your shell environment and set the correct variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mac-2B-devops-agent
&lt;span class="nb"&gt;source &lt;/span&gt;init.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your session times out or you need to re-authenticate- you can run the &lt;strong&gt;set_env.sh&lt;/strong&gt; script to reset your environment variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mac-2B-devops-agent
&lt;span class="nb"&gt;source &lt;/span&gt;set_env.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Variables like PROJECT_ID need to be setup for use in the various build scripts- so the set_env script can be used to reset the environment if you time-out.&lt;/p&gt;

&lt;h4&gt;
  
  
  Model Management Tool with MCP Stdio Transport
&lt;/h4&gt;

&lt;p&gt;One of the key features that the standard MCP libraries provide is abstracting various transport methods.&lt;/p&gt;

&lt;p&gt;The high level MCP tool implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.&lt;/p&gt;

&lt;p&gt;The simplest transport that the SDK supports is the stdio (stdio/stdout) transport — which connects a locally running process. Both the MCP client and MCP Server must be running in the same environment.&lt;/p&gt;

&lt;p&gt;The connection over stdio will look similar to this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Initialize FastMCP server
&lt;/span&gt;&lt;span class="n"&gt;mcp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastMCP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Self-Hosted vLLM DevOps Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Running the Python Code
&lt;/h4&gt;

&lt;p&gt;First- switch the directory with the Python version of the MCP sample code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;~/gemma4-tips/mac-2B-devops-agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the release version on the local system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight make"&gt;&lt;code&gt;&lt;span class="err"&gt;make&lt;/span&gt; &lt;span class="err"&gt;install&lt;/span&gt;
&lt;span class="err"&gt;Processing&lt;/span&gt; &lt;span class="err"&gt;./.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The project can also be linted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;m3:mac-2B-devops-agent xbill$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;make lint
&lt;span class="go"&gt;ruff check .
All checks passed!
ruff format --check .
10 files already formatted
mypy .
Success: no issues found in 10 source files
&lt;/span&gt;&lt;span class="gp"&gt;m3:mac-2B-devops-agent xbill$&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Antigravity CLI mcp_config.json
&lt;/h4&gt;

&lt;p&gt;A sample MCP server file is provided in the .agents directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"local-devops-agent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"/Users/xbill/gemma4-tips/mac-2B-devops-agent/server.py"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"GOOGLE_CLOUD_PROJECT"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aisprint-491218"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"MODEL_NAME"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"google/gemma-4-E2B-it"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"LOCAL_VLLM_PORT"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"8000"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;

      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Validation with Antigravity CLI
&lt;/h4&gt;

&lt;p&gt;The final connection test uses Antigravity CLI as a MCP client with the Python code providing the MCP server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;m3:mac-2B-devops-agent xbill$ agy

                  Antigravity CLI 1.0.3
                  xbill9@gmail.com (Google AI Ultra)
                  Gemini 3.5 Flash (Low)
                  ~/gemma4-tips/mac-2B-devops-agent

────────────────────────────────────────────────────────────────────────────────
&amp;gt; 
────────────────────────────────────────────────────────────────────────────────
MCP Servers

Plugins (~/.gemini/antigravity-cli/plugins)
&amp;gt; ✓ local-devops-agent Tools: verify_model_health, save_hf_token,
                         manage_docker, get_system_status, get_endpoint, +9 more
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Getting Started with Gemma 4 Locally
&lt;/h4&gt;

&lt;p&gt;Ollama automatically leverages your M3 Mac’s unified memory and Metal GPU architecture out-of-the-box. It runs local open-weight models (like Qwen, Llama, and Gemma) at high speeds without the need for discrete VRAM limits. [&lt;a href="https://www.reddit.com/r/ollama/comments/1echlso/ollama_llamacpp_and_macbooks/" rel="noopener noreferrer"&gt;1&lt;/a&gt;, &lt;a href="https://medium.com/@michael.hannecke/running-llms-locally-on-apple-silicon-a-practical-guide-for-developers-980deed326d9" rel="noopener noreferrer"&gt;2&lt;/a&gt;, &lt;a href="https://www.sitepoint.com/mac-m3-max-vs-rtx-4090-local-llm-performance-showdown-2026/" rel="noopener noreferrer"&gt;3&lt;/a&gt;, &lt;a href="https://medium.com/@borislavbankov/the-engine-beneath-ollama-was-wrong-for-your-mac-that-just-changed-0ff28116ac7f" rel="noopener noreferrer"&gt;4&lt;/a&gt;, &lt;a href="https://www.spheron.network/blog/run-llms-locally-ollama/" rel="noopener noreferrer"&gt;5&lt;/a&gt;]&lt;/p&gt;

&lt;p&gt;How to Run Ollama with Metal:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Install Ollama
Download the macOS installer directly from &lt;a href="https://ollama.com/" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt; or open your Terminal and run: curl -fsSL &lt;a href="https://ollama.com/install.sh" rel="noopener noreferrer"&gt;https://ollama.com/install.sh&lt;/a&gt; | sh.&lt;/li&gt;
&lt;li&gt;Pull a Model
Choose a model from the &lt;a href="https://ollama.com/library" rel="noopener noreferrer"&gt;Ollama Library&lt;/a&gt; and pull it in your terminal:
ollama pull qwen3:8b&lt;/li&gt;
&lt;li&gt;Run the Model
Initiate the prompt interface:
ollama run qwen3:8b [&lt;a href="https://www.youtube.com/watch?v=ucxtk-jUj6c" rel="noopener noreferrer"&gt;1&lt;/a&gt;, &lt;a href="https://medium.com/@dzbik.konrad/ollama-quick-start-for-macos-eb6de56e9cba" rel="noopener noreferrer"&gt;2&lt;/a&gt;, &lt;a href="https://www.mindstudio.ai/blog/run-open-weight-ai-models-locally-ollama-lm-studio" rel="noopener noreferrer"&gt;3&lt;/a&gt;, &lt;a href="https://docs.unstract.com/unstract/unstract_platform/adapters/llms/ollama_llm/" rel="noopener noreferrer"&gt;4&lt;/a&gt;, &lt;a href="https://www.reddit.com/r/ollama/comments/1srz7ci/ollama_in_m5_macbook_pro_issue_metal/" rel="noopener noreferrer"&gt;5&lt;/a&gt;]&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;M3 Hardware Tips:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model Size: Your model size will depend heavily on your physical unified memory (RAM). For example, a 7B or 8B model will take (\sim 4) GB of RAM, while a 70B model requires (\sim 40) GB. [&lt;a href="https://www.sitepoint.com/local-llms-apple-silicon-mac-2026/" rel="noopener noreferrer"&gt;1&lt;/a&gt;, &lt;a href="https://www.sitepoint.com/mac-m3-max-vs-rtx-4090-local-llm-performance-showdown-2026/" rel="noopener noreferrer"&gt;2&lt;/a&gt;]&lt;/li&gt;
&lt;li&gt;Unified Memory: Because your M3 CPU and GPU share the same memory pool, large models can be loaded entirely. Ensure you quit other memory-heavy applications to maximize your available token generation speeds. [&lt;a href="https://www.reddit.com/r/LocalLLaMA/comments/17sgwb8/m3_mac_performance_improvements/" rel="noopener noreferrer"&gt;1&lt;/a&gt;, &lt;a href="https://www.sitepoint.com/mac-m3-max-vs-rtx-4090-local-llm-performance-showdown-2026/" rel="noopener noreferrer"&gt;2&lt;/a&gt;, &lt;a href="https://www.reddit.com/r/ollama/comments/1pni9e4/can_i_try_ollama_with_a_macbook_air_m3/" rel="noopener noreferrer"&gt;3&lt;/a&gt;]&lt;/li&gt;
&lt;li&gt;Avoid Docker: Do not run Ollama inside a Docker container on macOS, as GPU passthrough and Metal acceleration are not natively supported there.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Model Lifecycle Management via MCP
&lt;/h4&gt;

&lt;p&gt;The MCP tools provide a complete suite of agent-oriented operations for managing vLLM deployment on Cloud Run or a TPU.&lt;/p&gt;

&lt;p&gt;Overview of MCP tools :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; print out mcp tools

  Here is the updated list of available MCP tools for the local Gemma 4 agent:

   Tool Name | Description | Code Location
  ---------------------------|-----------------------------------------------------------------------|-----------------
    analyze_local_logs | Analyzes the local container logs using Gemma 4 to find SRE/DevOps | server.py
                             | errors. |
    get_active_models | Gets the currently loaded models in Ollama's memory. | server.py
    get_docker_logs | Retrieves startup and execution logs from the Docker container. | server.py
    get_endpoint | Checks endpoint connectivity and returns the service URL. | server.py
    get_help | Provides help text and summarizes the configuration options. | server.py
    get_model_show_details | Gets parameters, architecture, license, and config for a model. | server.py
    get_system_details | Retrieves detailed information about the running local model, engine, | server.py
                             | and versions. |
    get_system_status | Displays a dashboard of container and serving health status. | server.py
    manage_docker | Starts, stops, restarts, or removes the container. | server.py
    query_gemma4 | Standard prompt query interface for the local model. | server.py
    query_gemma4_with_stats | Prompt interface returning performance metrics (latency, TTFT). | server.py
    run_benchmark | Runs the benchmark suite or vLLM benchmarking script. | server.py
    save_hf_token | Securely caches a Hugging Face API token. | server.py
    verify_model_health | Runs a deep health check verifying latency and response correctness. | server.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Checking System status
&lt;/h4&gt;

&lt;p&gt;The status can be checked with an MCP tool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; get_system_status

● local-devops-agent/get_system_status(Get system status) (ctrl+o to expand)

  ### 🌀 Local System Status

  • vLLM Health: 🟢 Online (http://localhost:8000)
  • Native Ollama Service Status: 🟢 Running (ollama daemon is active via
  Homebrew launch agent)

  Next Steps: You can now interact with the local model using the            
  query_gemma4 tool.

──────────────────────────
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Deploy The Model Locally
&lt;/h4&gt;

&lt;p&gt;Once the base environment has been checked- the Gemma model can be deployed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; deploy gemma4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Cross Check The Deployed Model
&lt;/h4&gt;

&lt;p&gt;Once the model starts locally- the MCP tool allows for model verification:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; model health check
  • Status: ✅ Model health check PASSED.
  • Model Response: "Hello! Yes, I am working and ready to assist you..."
  • Inference Latency: 4.84 seconds.

  The initial empty response was likely due to a transient cold-start or
  model
  loading delay (the model was active in CPU/GPU memory). The local          
  gemma4:e2b model is now fully healthy and ready for queries.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and model details:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; get_active_models

● local-devops-agent/get_active_models(Get active models) (ctrl+o to expand)

  ### 📊 Active Loaded Models:

    NAME ID SIZE PROCESSOR CONTEXT
  UNTIL
    gemma4:e2b bcec27ff2bea 7.7 GB 76%/24% CPU/GPU 2048 4
  minutes from now
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and the endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; get_endpoint

● local-devops-agent/get_endpoint(Get endpoint) (ctrl+o to expand)

  🟢 Local vLLM/Ollama is Online at: http://localhost:8000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and status:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;get_system_details&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;●&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;local-devops-agent/get_system_details(Get&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;system&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;details)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;(ctrl+o&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;expand)&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="err"&gt;###&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;🧩&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Model&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Details&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;(http://localhost:&lt;/span&gt;&lt;span class="mi"&gt;8000&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="err"&gt;Model&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Information&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;/v&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="err"&gt;/models&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;):&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"object"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"list"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"data"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
 &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gemma4:e2b"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"object"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"created"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1780234616&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"owned_by"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"library"&lt;/span&gt;&lt;span class="w"&gt;
 &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gemma4:e4b"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"object"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"created"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1780234234&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"owned_by"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"library"&lt;/span&gt;&lt;span class="w"&gt;
 &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="err"&gt;•&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Health&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Status:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Healthy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;✅&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Review the Model
&lt;/h4&gt;

&lt;p&gt;Check the model details:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; Here are the deep parameters and configurations for the loaded model
● local-devops-agent/get_model_show_details(Get model show ...) (ctrl+o to
expand)

  Here are the deep parameters and configurations for the loaded model (     
  gemma4:e2b ):

  ### 🧩 Model Details for gemma4:e2b 

  • Architecture: gemma4 
  • Parameters: 5.1B 
  • Context Length: 131,072 (Configured to 1024 for run)
  • Embedding Length: 1,536 
  • Quantization: Q4_K_M 
  • Capabilities: Completion, Vision, Audio, Tools, Thinking
  • Parameters Config:
      • temperature : 1 
      • top_k : 64 
      • top_p : 0.95 
  • License: Apache License 2.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once everything is online — you can send queries to the Model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; query_gemma4_with_stats what is gemma?

● local-devops-agent/query_gemma4_with_stats(Query Gemma 4 ...) (ctrl+o to
expand)

  ### 📊 Inference Performance Stats

  • Time to First Token (TTFT): 10.912s 
  • Total Generation Time: 28.396s 
  • Throughput: 14.93 tokens/s 
  • Total Tokens (approx.): 261 
  ──────
  ### 💬 Model Response

  "Gemma" primarily refers to a family of lightweight, open-weights large    
  language models (LLMs) developed by Google DeepMind.

  Here is a breakdown of what that means:

  1. Large Language Models (LLMs): Like other LLMs, Gemma models are trained
  on massive amounts of text data, allowing them to understand, generate, and
  process human language.
  2. Open Weights: A key feature of the Gemma models is that they are
  released with open weights. This means researchers, developers, and the
  public have access to the model parameters, allowing them to inspect,
  modify, fine-tune, and build upon the models for specific applications.
  3. Lightweight and Efficient: The models are designed to be efficient,
  meaning they offer strong performance while being relatively lightweight
  compared to some of the largest models. This makes them accessible for
  deployment on a wider range of hardware.
  4. Family: The Gemma family includes various sizes (e.g., different
  parameter counts) to suit different computational needs, ranging from
  smaller, highly efficient models to more capable ones.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Benchmark the Local Model
&lt;/h4&gt;

&lt;p&gt;The local model was benchmarked:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;● Create(/Users/xbill/.gemini/an...benchmark_comparison.md) (ctrl+o to expand)

  I have compiled a comparison report between the two benchmark results:

  • 4B Agent File: benchmark_results.csv
  • 2B Agent File: benchmark_results.csv

  The full analysis is saved in the artifact: benchmark_comparison.md.

  ### Summary Comparison

   Metric | Concurrency | 4B Model (m… | 2B Model (ma… | Speedup / Di…
  --------------|-------------|--------------|---------------|---------------
   Avg Latency | 1 | 1.57s | 0.59s | 2.66x faster
   (s) | | | | (2B)
                | 2 | 2.68s | 0.79s | 3.39x faster
                | | | | (2B)
   P95 Latency | 1 | 1.57s | 0.61s | 2.57x faster
   (s) | | | | (2B)
                | 2 | 3.02s | 1.01s | 2.99x faster
                | | | | (2B)
   Req/sec | 1 | 0.64 | 1.70 | +166% (2B)
                | 2 | 0.67 | 1.98 | +195% (2B)
   Tokens/sec | 1 | 20.44 | 27.25 | +33.3% (2B)
                | 2 | 21.57 | 31.70 | +47.0% (2B)

  ### Key Takeaways

  1. Performance: The 2B model is significantly faster across the board,
  processing requests with roughly 3x lower latency and achieving 47% more   
  tokens per second at concurrency 2.
  2. Scalability: While the 4B model shows a sharp latency increase from
  concurrency 1 to 2, the 2B model handles concurrency scaling up to 16 with
  very little performance degradation (latency remaining under ~0.83s).
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is visualized:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk3vghq32bdwias0u3rjk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk3vghq32bdwias0u3rjk.png" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Summary
&lt;/h4&gt;

&lt;p&gt;The strategy for using MCP for local Gemma 4 deployment with Antigravity CLI was validated with a incremental step by step approach.&lt;/p&gt;

&lt;p&gt;A minimal stdio transport MCP Server was started from Python source code and validated with Antigravity CLI running as a MCP client in the same local environment. This Python server provided all of the management tools to deploy and troubleshoot local Model deployments.&lt;/p&gt;

</description>
      <category>mcpserver</category>
      <category>llm</category>
    </item>
    <item>
      <title>Mix and Match: Running Kiro on Google Cloud Shell</title>
      <dc:creator>xbill</dc:creator>
      <pubDate>Sun, 31 May 2026 12:12:24 +0000</pubDate>
      <link>https://dev.to/gde/mix-and-match-running-kiro-on-google-cloud-shell-5hco</link>
      <guid>https://dev.to/gde/mix-and-match-running-kiro-on-google-cloud-shell-5hco</guid>
      <description>&lt;p&gt;This article provides a step by step deployment guide for using Amazon Kiro CLI on the Google Cloud Shell.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5dr9bpso8u286rs07ys8.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5dr9bpso8u286rs07ys8.jpeg" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  What is this project trying to Do?
&lt;/h4&gt;

&lt;p&gt;This project aims to configure the Kiro CLI for seamless cross cloud operations across AWS and Google Cloud.&lt;/p&gt;

&lt;h4&gt;
  
  
  Aren’t You Running on the Wrong Cloud?
&lt;/h4&gt;

&lt;p&gt;Mix and Match — Baby! The Antigravity CLI runs on AWS Cloud Shell:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://towardsaws.com/aws-cloud-shell-with-antigravity-cli-50f6d44f85a1" rel="noopener noreferrer"&gt;AWS Cloud Shell with Antigravity CLI&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So let’s give Kiro a shot on Google Cloud Shell!&lt;/p&gt;

&lt;h4&gt;
  
  
  What is the Google Cloud Shell Terminal and Editor?
&lt;/h4&gt;

&lt;p&gt;The Google Cloud Shell- available on every Google Cloud project — comes with a built-in code editor that allows you to browse file directories, view and edit files, with continued access to the Cloud Shell. The Cloud Shell Editor is available by default with every Cloud Shell instance, and is based on &lt;a href="https://github.com/microsoft/vscode" rel="noopener noreferrer"&gt;Code OSS&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Where Can I find the Cloud Shell?
&lt;/h4&gt;

&lt;p&gt;The Cloud Shell is provided on the Google Cloud Console for all Google Cloud Projects. It is available from the small icon on the upper right next to the Gemini Icon:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frzfm14v90txv4ib8oefz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frzfm14v90txv4ib8oefz.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Cloud Shell Editor is a basic hosted VSCode environment for rapid prototyping. It offers a minimal IDE environment to get started. For more in depth development-other options are available like Firebase Studio or developing from a GCE Cloud VM.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7gghl8sjnuvnk22q5gih.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7gghl8sjnuvnk22q5gih.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  AWS Setup
&lt;/h4&gt;

&lt;p&gt;The AWS CLI provides the basic tools for working with the AWS services:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://aws.amazon.com/cli/" rel="noopener noreferrer"&gt;AWS CLI&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Check the AWS installation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;xbill9@cloudshell:~ (aisprint-491218)$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;/usr/local/bin/aws &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;span class="go"&gt;aws-cli/2.34.57 Python/3.14.5 Linux/6.6.137+ exe/x86_64.ubuntu.24
&lt;/span&gt;&lt;span class="gp"&gt;xbill9@cloudshell:~ (aisprint-491218)$&lt;/span&gt;&lt;span class="w"&gt; 
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the tools are installed — login to the AWS console:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws login &lt;span class="nt"&gt;--remote&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  What is Kiro
&lt;/h4&gt;

&lt;p&gt;Kiro is an agentic, AI-powered Integrated Development Environment (IDE) built by Amazon Web Services (AWS) designed to help developers move from initial concepts to production-ready software. Based on a fork of VS Code (Code OSS), Kiro feels familiar while acting like an autonomous co-developer. [&lt;a href="https://dev.to/aws-builders/introducing-kiro-an-ai-ide-that-thinks-like-a-developer-42jp"&gt;1&lt;/a&gt;, &lt;a href="https://caylent.com/blog/kiro-first-impressions" rel="noopener noreferrer"&gt;2&lt;/a&gt;, &lt;a href="https://www.constellationr.com/insights/news/aws-launches-kiro-ide-powered-ai-agents" rel="noopener noreferrer"&gt;3&lt;/a&gt;]&lt;/p&gt;

&lt;h4&gt;
  
  
  Installing Kiro
&lt;/h4&gt;

&lt;p&gt;The main download site for Kiro is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://kiro.dev/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://kiro.dev/" rel="noopener noreferrer"&gt;Kiro: Bring engineering rigor to agentic development&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8jto4o59z59l417plw06.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8jto4o59z59l417plw06.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;for the CLI download — click CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://cli.kiro.dev/install | bash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will download Kiro to Google Cloud Shell:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1fctjmm3q5gutsn0of0x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1fctjmm3q5gutsn0of0x.png" width="800" height="367"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Running Kiro
&lt;/h4&gt;

&lt;p&gt;First — authenticate to Kiro:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fit6wgp2n7r271m9pm36y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fit6wgp2n7r271m9pm36y.png" width="800" height="367"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then confirm:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgq4154u1h620exjsg9pb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgq4154u1h620exjsg9pb.png" width="800" height="367"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Verify The Installation
&lt;/h4&gt;

&lt;p&gt;Kiro has full access to the local Google Cloud Shell Environment:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7gmb4cv0oant9u3dzlr8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7gmb4cv0oant9u3dzlr8.png" width="800" height="364"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; summarize my google cloud project

● Shell gcloud config list 2&amp;gt;/dev/null &amp;amp;&amp;amp; echo "---" &amp;amp;&amp;amp; gcloud projects describe $(gcloud config get-value project
  2&amp;gt;/dev/null) 2&amp;gt;/dev/null
● Shell echo "=== Enabled APIs ===" &amp;amp;&amp;amp; gcloud services list --enabled --format="value(name)" 2&amp;gt;/dev/null | sed
  's/.googleapis.com//' &amp;amp;&amp;amp; echo "=== Compute Instances ===" &amp;amp;&amp;amp; gcloud compute instances list 2&amp;gt;/dev/null &amp;amp;&amp;amp; echo "=== Cloud
  Run Services ===" &amp;amp;&amp;amp; gcloud run services list 2&amp;gt;/dev/null &amp;amp;&amp;amp; echo "=== GCS Buckets ===" &amp;amp;&amp;amp; gsutil ls 2&amp;gt;/dev/null &amp;amp;&amp;amp; echo
  "=== Cloud Functions ===" &amp;amp;&amp;amp; gcloud functions list 2&amp;gt;/dev/null &amp;amp;&amp;amp; echo "=== Billing ===" &amp;amp;&amp;amp; gcloud billing projects
  describe aisprint-491218 2&amp;gt;/dev/null
    === Enabled APIs ===
    projects/289270257791/services/addressvalidation
    projects/289270257791/services/aerialview
    projects/289270257791/services/agentregistry
    projects/289270257791/services/aiplatform
    ...+167 lines (ctrl+o to toggle)

  Here's a summary of your Google Cloud project:

  Project

  - Name: aisprint (ID: aisprint-491218)
  - Account: xbill9@gmail.com
  - Created: March 24, 2026 — Active

  Cloud Run Services (6 total, all deployed today/yesterday)

  ┌───────────────────────────┬─────────────┬─────────────┐
  │ Service │ Region │ Status │
  ├───────────────────────────┼─────────────┼─────────────┤
  │ gpu-4b-6000-devops-agent │ us-central1 │ ✅ │
  ├───────────────────────────┼─────────────┼─────────────┤
  │ gpu-4b-l4-devops-agent │ us-east4 │ ✅ │
  ├───────────────────────────┼─────────────┼─────────────┤
  │ gpu-26b-6000-devops-agent │ us-central1 │ ✅ │
  ├───────────────────────────┼─────────────┼─────────────┤
  │ gpu-26b-l4-devops-agent │ us-east4 │ ✅ │
  ├───────────────────────────┼─────────────┼─────────────┤
  │ gpu-31b-6000-devops-agent │ us-central1 │ ✅ │
  ├───────────────────────────┼─────────────┼─────────────┤
  │ gpu-31b-l4-devops-agent │ us-east4 │ ❌ (failed) │
  └───────────────────────────┴─────────────┴─────────────┘

  The naming pattern suggests GPU-accelerated devops agents in two GPU tiers (NVIDIA L4 and RTX 6000) across three model
  sizes (4B, 26B, 31B parameters).

  Storage

  - aisprint-491218-bucket
  - aisprint-491218-source-bucket
  - aisprint-491218_cloudbuild

  No active Compute instances or Cloud Functions.

  Enabled APIs — heavily focused on AI/ML:

  - AI Platform, Vertex AI (aiplatform), Generative Language, Gemini, ML
  - Maps Platform (full suite), BigQuery, Cloud Run, Cloud Build
  - Secret Manager, Pub/Sub, Dataflow, GKE, Artifact Registry
  - Google Workspace APIs (Gmail, Drive, Calendar, Chat) + MCP variants

  The project is clearly an AI agent platform, running GPU-backed LLM inference services on Cloud Run with GPU support.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Test The AWS Integration
&lt;/h4&gt;

&lt;p&gt;This step tests the Kiro interactions with the remote AWS environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; Compute

  - 1 EC2 instance: i-03456018d00d13504 (t3.small, stopped)
  - No active ECS clusters or EKS clusters running (stacks exist, clusters likely deleted)

  Lambda

  - AgentCore-mcpBedrockRustA-ApplicationAgentMcpServe-rWGgAcNdRUnH

  S3 Buckets (5)

  - bedrock-agentcore-runtime-... (us-west-2)
  - cdk-hnb659fds-assets-... (CDK bootstrap)
  - elasticbeanstalk-us-east-1-...
  - my-adk-comic-artifacts
  - xbill-test

  CloudFormation Stacks (13 active)

  - Multiple AgentCore-* stacks (Bedrock, ADK, MCP variants)
  - eksctl-* stacks for 3 EKS clusters (mcp-eks, adkui-eks, biometric-scout) — infrastructure stacks remain even if clusters
  are torn down
  - CDKToolkit bootstrap

  ECR Repositories (33)

  - Heavy focus on AI agent workloads: course creator pipelines (orchestrator, researcher, content-builder, judge), MCP
  servers, ADK agents, biometric-scout
  - Both Python and Rust MCP server variants

  No active: RDS, EKS clusters, ECS, Elastic Beanstalk environments

  ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

  The environment is clearly an AI agent development platform — experimenting with Bedrock AgentCore, ADK (Agent Development
  Kit), and MCP (Model Context Protocol) server patterns, with both Python and Rust implementations.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  So Why? Just Why? Why Mix and Match Cloud Setups?
&lt;/h4&gt;

&lt;p&gt;Kiro was used to summarize the advantages of the approach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  summarize the benefits of using Kiro from the Google Cloud Shell

  Based on your specific environment — an AI agent developer working across GCP and AWS from Cloud Shell — here's where Kiro
  adds real value:

  Cross-Cloud Visibility in One Place

  You just saw this: Kiro can summarize your GCP project and AWS environment in a single conversation, correlating context
  across both clouds without switching tools or consoles.

  Codebase Intelligence

  Cloud Shell gives you a terminal; Kiro gives you a coding partner. For your agent repos (ADK, MCP servers, Bedrock
  AgentCore), Kiro can read, understand, and modify code across files — not just run commands.

  Infrastructure as Code Assistance

  With 13 CloudFormation stacks and CDK in use, Kiro can help write, debug, and refactor IaC — understanding the intent
  behind stacks like eksctl-adkui-eks-cluster rather than just listing them.

  Agentic Workflows Without Context Switching

  You're building AI agents. Kiro can help you design, implement, and deploy them — writing the Cloud Run service configs,
  Dockerfiles, Lambda handlers, and Bedrock agent definitions — all from the same shell session where you're already
  working.

  Persistent Task Execution

  Unlike a plain terminal, Kiro tracks multi-step tasks, catches errors, and adapts — useful when deploying across regions
  (your us-central1 / us-east4 Cloud Run pattern) or debugging failed services like gpu-31b-l4-devops-agent.

  No Extra Tooling

  Cloud Shell already has gcloud, aws, docker, git, and python — everything Kiro needs. There's nothing to install; you get
  AI-assisted development on top of the environment you already have.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Summary
&lt;/h4&gt;

&lt;p&gt;The Google Cloud Shell and Cloud Shell Terminal provides an environment for rapid prototyping Agentic development. With some configuration and customization- it provides core functionality cross cloud using the Kiro CLI.&lt;/p&gt;




</description>
      <category>kiro</category>
      <category>aws</category>
      <category>googlecloudplatform</category>
      <category>gcp</category>
    </item>
    <item>
      <title>Untrusted Code, Trusted Cluster Scaling Secure AI Agent Workspaces with GKE Agent Sandbox</title>
      <dc:creator>Saurabh Mishra</dc:creator>
      <pubDate>Sun, 31 May 2026 04:03:42 +0000</pubDate>
      <link>https://dev.to/gde/untrusted-code-trusted-cluster-scaling-secure-ai-agent-workspaces-with-gke-agent-sandbox-1mk1</link>
      <guid>https://dev.to/gde/untrusted-code-trusted-cluster-scaling-secure-ai-agent-workspaces-with-gke-agent-sandbox-1mk1</guid>
      <description>&lt;p&gt;How gVisor-powered sandbox isolates AI-generated code at the kernel level and why it changes everything for multi-tenant agentic systems.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fffa10fmqc6d97z39yixp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fffa10fmqc6d97z39yixp.png" alt=" " width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this article we are going discuss on below points&lt;/p&gt;

&lt;p&gt;The problem with AI agents writing code&lt;br&gt;
What is GKE Agent Sandbox?&lt;br&gt;
How gVisor intercepts the kernel&lt;br&gt;
Architecture deep dive&lt;br&gt;
Setting it up: step by step&lt;br&gt;
Production patterns&lt;br&gt;
Conclusion&lt;/p&gt;

&lt;p&gt;There's a moment every engineer running AI agents eventually faces: an LLM generates a perfectly plausible subprocess.run() call, pipes it to bash -c, and realise that one prompt injection away from a full container escape. The code looks reasonable. The agent trusts itself. And cluster's blast radius just became everyone's problem.&lt;/p&gt;

&lt;p&gt;This is the defining security problem of the agentic era. Language models don't just generate text anymore  they write, execute, and iterate on code in tight feedback loops. The capabilities that make them useful (unrestricted Python, shell access, file I/O) are exactly the capabilities that make them dangerous in a shared cluster.&lt;/p&gt;

&lt;p&gt;Google's answer — &lt;strong&gt;GKE Agent Sandbox&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GKE Agent Sandbox&lt;/strong&gt; is built for agentic workloads that require high-level scale, extensibility, and security. Key benefits include:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kernel-level isolation&lt;/strong&gt;: Provides strong, kernel-level isolation for untrusted, LLM-generated code by using built-in GKE features like GKE Sandbox. Agent Sandbox also supports the open source Kata Containers software.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sub-second provisioning&lt;/strong&gt;: Offers an out-of-the-box mechanism to provide sandboxes significantly faster than standard Kubernetes Pod scheduling allows (typically &amp;lt;1s).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud-native extensibility&lt;/strong&gt;: Leverages the power of the Kubernetes paradigm and the managed infrastructure of GKE.&lt;/p&gt;

&lt;p&gt;By providing a declarative, standardized API, GKE Agent Sandbox offers a single-container experience that provides isolation and persistence characteristics similar to a virtual machine (VM), built entirely on Kubernetes primitives&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The problem with AI agents writing code&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Agentic AI systems whether you're building with LangGraph, AutoGen, Claude's tool-use API, or rolling your own share a common architectural pattern: the model generates code, a runtime executes it, results flow back to the model, and the loop continues. At each iteration, the model has broader context about what worked and what didn't. This is enormously powerful for automating complex tasks.&lt;/p&gt;

&lt;p&gt;It also creates an attack surface that traditional Kubernetes security was never designed to handle.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2bnsnippzxnj3bpgls0m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2bnsnippzxnj3bpgls0m.png" alt=" " width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Container escape&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;LLM-generated code exploits known kernel vulnerabilities or misconfigured capabilities to break out of the container boundary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt injection via code output&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Malicious content in retrieved data embeds instructions that manipulate the agent into executing attacker-controlled payloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lateral network movement&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An agent with network access can enumerate internal services, extract credentials, and pivot across your cluster — all through legitimate-looking Python requests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Filesystem exfiltration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without mount restrictions, agents can read service account tokens, Kubernetes secrets mounted as volumes, and host path data.&lt;/p&gt;

&lt;p&gt;Standard container security — &lt;strong&gt;securityContext&lt;/strong&gt;, network policies, Pod Security Admission  provides defence in depth but doesn't address the fundamental issue: containers share the host kernel. If the kernel has a vulnerability, a sufficiently motivated attacker (or sufficiently capable LLM) can exploit it regardless of namespace isolation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is GKE Agent Sandbox?&lt;/strong&gt;&lt;br&gt;
GKE Agent Sandbox is a Google-managed node pool configuration that applies gVisor-based container sandboxing specifically tuned for agentic AI workloads. &lt;/p&gt;

&lt;p&gt;At its core, it combines three things:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;gVisor runtime (runsc) as the default OCI runtime&lt;/strong&gt;&lt;br&gt;
Every container in the sandbox node pool runs under runsc instead of the standard runc. This intercepts all syscalls through a user-space kernel implementation called Sentry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent-specific resource isolation profiles&lt;/strong&gt;&lt;br&gt;
Pre-configured seccomp and AppArmor profiles optimised for Python/Node.js/container-in-container workloads that AI agents commonly generate. No manual tuning of syscall allowlists required for standard use cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Integrated observability via Cloud Monitoring&lt;/strong&gt;&lt;br&gt;
Syscall audit logs, sandbox violation events, and resource consumption metrics flow automatically into Cloud Monitoring — giving you behavioural baselines for agent workloads without custom instrumentation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How gVisor intercepts the kernel&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Understanding what gVisor actually does is essential for reasoning about its security guarantees. The mental model most engineers have of containers — "a process with namespaces and cgroups" — breaks down when thinking about gVisor.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl88o6ihjts4yx3x32efa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl88o6ihjts4yx3x32efa.png" alt=" " width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In a standard container, your application's open(), read(), execve(), and socket() calls go directly to the host Linux kernel via the system call interface. The kernel has to handle them, which means a kernel vulnerability is reachable from inside the container.&lt;/p&gt;

&lt;p&gt;With gVisor, those same syscalls are intercepted by Sentry  a Go implementation of the Linux kernel that runs entirely in user space. Sentry implements the Linux ABI from scratch. When your agent code calls execve(), it's Sentry that handles it, not the host kernel. Sentry then makes a much smaller set of calls to the actual host kernel (through a restricted interface called the "platform") to handle things like memory mapping and scheduling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;End-to-End Architectural Blueprint&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To isolate untrusted code execution while maintaining a highly responsive management plane, the architecture splits the cluster into two distinct, specialized node pools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Standard Node Pool (The Brain)&lt;/strong&gt;- This pool runs your trusted, long-lived orchestration services. Because this code is written and audited by your team, it runs on the standard Linux host kernel for maximum performance and native access to internal cluster resources.Agent Controller: The core engine managing the life cycle of AI agent tasks, spin-up times, and state tracking.Tool Router: Mediates external API calls and manages what capabilities (e.g., web search, database querying) are exposed to the agent.Result Collector: Aggregates outputs, logs, and state changes from the runtime pods.State &amp;amp; Storage (Postgres/Redis): Highly available data layers tracking session memory and agent state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent Sandbox Node Pool (The Muscle)&lt;/strong&gt; - This pool is dedicated entirely to executing untrusted code generated by AI models. It uses the runtimeClassName: gvisor configuration to enforce strict kernel-level isolation.Code Executor Pods ($N$ Pods): Ephemeral, rapid-churn pods designed to spin up, run a specific snippet of generated code, and terminate.The Sentry (User-Space Kernel): gVisor’s core component. Instead of letting a Python agent talk directly to the host Linux kernel via standard system calls (syscall()), the Sentry intercepts them. It implements a core suite of Linux kernel primitives in user-space, shielding the host bare-metal or VM infrastructure from container escape vulnerabilities.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkkmtjchyjhzkf893fs5k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkkmtjchyjhzkf893fs5k.png" alt=" " width="800" height="484"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Workload Identity &amp;amp; RBAC Separation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By separating Kubernetes Service Accounts (KSAs) and mapping them to distinct Google Cloud IAM Service Accounts, we eliminate the risk of privilege escalation if an agent is compromised.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observability and Behavioral Analysis&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Because sandbox runtimes are naturally adversarial, observability shifts from standard application performance monitoring (APM) to real-time behavioral and security auditing&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2zegoq0vtbbkkubdxsxw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2zegoq0vtbbkkubdxsxw.png" alt=" " width="799" height="252"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Syscall Audit Logs&lt;/strong&gt;: gVisor provides structural logs of intercepted system calls via its internal logging mechanisms. Unusual system calls (e.g., attempts to call forbidden network protocols or direct raw socket manipulations) are immediately streamed to Cloud Logging.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Violation Events&lt;/strong&gt;: Any attempt by a sandboxed container to bypass the Sentry or execute an invalid operation triggers an immediate containment event, surfaced directly in Google Cloud Security Command Center.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud Monitoring&lt;/strong&gt;: Aggregates container-level metrics (CPU, Memory, Churn rate). Crucial for detecting malicious infinite loops or resource-exhaustion (DDoS) attempts disguised as AI agent tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud Trace&lt;/strong&gt;: End-to-end distributed tracing maps exactly how long a request spends routing through the Tool Router versus how long it spends executing inside the gVisor sandbox, allowing you to fine-tune the performance overhead introduced by user-space context switching.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setting it up: step by step&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's a complete walkthrough from a fresh GKE cluster to a running sandboxed agent workload. This assumes you have gcloud, kubectl, and Terraform configured for project.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Farigxfvp4idh54alhp0t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Farigxfvp4idh54alhp0t.png" alt=" " width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production patterns&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pattern 1: Warm pool with pre-forked executors&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cold-starting a new pod for every code execution adds latency. The standard pattern is to maintain a pool of warm executor pods that listen for work over a task queue (Pub/Sub or Redis Streams). The controller dispatches code snippets to idle executors; completed executors reset their environment and return to the pool. A garbage collection sidecar restarts pods that have been warm too long to prevent state accumulation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pattern 2: Execution budget enforcement&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AI agents can get into infinite loops. Beyond Kubernetes resource limits, apply an application-level timeout using Python's signal.alarm or Go's context cancellation. A 30-second wall-clock timeout with a 10-second CPU-time budget covers almost all legitimate agent code execution patterns while preventing runaway loops from consuming pool capacity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pattern 3: Network egress allow-listing per agent type&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Different agent personas have different legitimate network needs. A data analysis agent needs access to BigQuery and GCS. A web research agent needs HTTP egress to public internet. A code review agent needs neither. Model this with separate NetworkPolicies per agent label, and use PodSpec labels to bind agents to the right policy at scheduling time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The agentic era is here, and it runs on code execution. Whether you're building autonomous research assistants, DevOps automation agents, or data pipeline orchestrators,eventually going to need a principled answer to the question: what happens when the model writes something it shouldn't?&lt;/p&gt;

&lt;p&gt;GKE Agent Sandbox doesn't make the threat go away. Prompt injection is still a model-level problem. Lateral movement still requires complementary network controls. Secrets management still requires RBAC discipline. But the sandbox answers a specific, hard question — what if agent-generated code exploits a kernel vulnerability or escalates privileges? — with a credible, production-tested answer: it runs against Sentry, not your host kernel.&lt;/p&gt;

&lt;p&gt;For most teams running agentic workloads on GKE, the operational cost is low (a single node pool configuration), the performance cost is acceptable (single-digit percentages for typical agent workload patterns), and the security benefit is significant (kernel-level isolation with full Kubernetes observability).&lt;/p&gt;

&lt;p&gt;That's the architectural question GKE Agent Sandbox is designed to answer. Build agentic systems with the assumption that the code will sometimes be wrong, sometimes be manipulated, and occasionally be malicious  and design your execution environment accordingly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;References and Documentation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.cloud.google.com/kubernetes-engine/docs/how-to/agent-sandbox" rel="noopener noreferrer"&gt;https://docs.cloud.google.com/kubernetes-engine/docs/how-to/agent-sandbox&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.cloud.google.com/kubernetes-engine/docs/concepts/machine-learning/agent-sandbox" rel="noopener noreferrer"&gt;https://docs.cloud.google.com/kubernetes-engine/docs/concepts/machine-learning/agent-sandbox&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>googlecloud</category>
      <category>kubernetes</category>
      <category>cloud</category>
    </item>
    <item>
      <title>26B Gemma 4 Deployment with NVIDIA Blackwell 6000, MCP, Cloud Run, and Antigravity CLI</title>
      <dc:creator>xbill</dc:creator>
      <pubDate>Sat, 30 May 2026 17:27:07 +0000</pubDate>
      <link>https://dev.to/gde/26b-gemma-4-deployment-with-nvidia-blackwell-6000-mcp-cloud-run-and-antigravity-cli-ipj</link>
      <guid>https://dev.to/gde/26b-gemma-4-deployment-with-nvidia-blackwell-6000-mcp-cloud-run-and-antigravity-cli-ipj</guid>
      <description>&lt;p&gt;This article provides a step by step deployment guide for Gemma 4 to a Google Cloud Run hosted GPU enabled system,. A suite of Python MCP tools is built to simplify management of the vLLM hosted Gemma 4 deployment with Antigravity CLI.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fprgzxj1uzsqeaii6cvvv.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fprgzxj1uzsqeaii6cvvv.jpeg" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  What is this project trying to Do?
&lt;/h4&gt;

&lt;p&gt;This project is a DevOps/SRE assistant that uses a Gemma 4 model hosted on Cloud Run with GPU. It provides tools to provision the Docker container and deploy the model, as well as for observability and performance testing.&lt;/p&gt;

&lt;p&gt;This project is similar to a previous project that targeted GPU hosted Gemma4 instances on GCP:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://medium.com/google-cloud/gemma-sre-self-hosted-vllm-infrastructure-agent-afda000c0943" rel="noopener noreferrer"&gt;Gemma-SRE: Self-Hosted vLLM Infrastructure Agent&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Antigravity CLI
&lt;/h4&gt;

&lt;p&gt;Antigravity CLI is the follow-on successor to Gemini CLI- the terminal driven, agent assisted coding tool.&lt;/p&gt;

&lt;p&gt;Full details on installing Antigravity CLI are here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://medium.com/google-cloud/getting-started-with-antigravity-cli-26c5da90951f" rel="noopener noreferrer"&gt;Getting Started with Antigravity CLI&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Testing the Antigravity CLI Environment
&lt;/h4&gt;

&lt;p&gt;Once you have all the tools in place- you can test the startup of Antigravity CLI.&lt;/p&gt;

&lt;p&gt;You will need to authenticate with a Google Cloud Project or your Google Account:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;agy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will start the interface:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6mvg4a6tnyt4j8666lb1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6mvg4a6tnyt4j8666lb1.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Full Installation Instructions
&lt;/h4&gt;

&lt;p&gt;The detailed installation instructions for Antigravity CLI are here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://medium.com/google-cloud/getting-started-with-antigravity-cli-26c5da90951f" rel="noopener noreferrer"&gt;Getting Started with Antigravity CLI&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Python MCP Documentation
&lt;/h4&gt;

&lt;p&gt;The official GitHub Repo provides samples and documentation for getting started:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/modelcontextprotocol/python-sdk" rel="noopener noreferrer"&gt;GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Where do I start?
&lt;/h4&gt;

&lt;p&gt;The strategy for starting MCP development for model management is a incremental step by step approach.&lt;/p&gt;

&lt;p&gt;First, the basic development environment is setup with the required system variables, and a working Antigravity CLI configuration.&lt;/p&gt;

&lt;p&gt;Then, a minimal Python MCP Server is built with stdio transport. This server is validated with Antigravity CLI in the local environment.&lt;/p&gt;

&lt;p&gt;This setup validates the connection from Antigravity CLI to the local server via MCP. The MCP client (Antigravity CLI) and the Python MCP server both run in the same local environment.&lt;/p&gt;

&lt;h4&gt;
  
  
  Setup the Basic Environment
&lt;/h4&gt;

&lt;p&gt;At this point you should have a working Python environment and a working Antigravity CLI installation. The next step is to clone the GitHub samples repository with support scripts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~
git clone https://github.com/xbill9/gemma4-tips
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run &lt;strong&gt;init.sh&lt;/strong&gt; from the cloned directory.&lt;/p&gt;

&lt;p&gt;The script will attempt to determine your shell environment and set the correct variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gpu-26B-6000-devops-agent
&lt;span class="nb"&gt;source &lt;/span&gt;init.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your session times out or you need to re-authenticate- you can run the &lt;strong&gt;set_env.sh&lt;/strong&gt; script to reset your environment variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gpu-26B-6000-devops-agent
&lt;span class="nb"&gt;source &lt;/span&gt;set_env.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Variables like PROJECT_ID need to be setup for use in the various build scripts- so the set_env script can be used to reset the environment if you time-out.&lt;/p&gt;

&lt;h4&gt;
  
  
  Model Management Tool with MCP Stdio Transport
&lt;/h4&gt;

&lt;p&gt;One of the key features that the standard MCP libraries provide is abstracting various transport methods.&lt;/p&gt;

&lt;p&gt;The high level MCP tool implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.&lt;/p&gt;

&lt;p&gt;The simplest transport that the SDK supports is the stdio (stdio/stdout) transport — which connects a locally running process. Both the MCP client and MCP Server must be running in the same environment.&lt;/p&gt;

&lt;p&gt;The connection over stdio will look similar to this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Initialize FastMCP server
&lt;/span&gt;&lt;span class="n"&gt;mcp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastMCP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Self-Hosted vLLM DevOps Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Running the Python Code
&lt;/h4&gt;

&lt;p&gt;First- switch the directory with the Python version of the MCP sample code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/gemma4-tips/gpu-6000-devops-agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the release version on the local system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;make &lt;span class="nb"&gt;install
&lt;/span&gt;Processing ./.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The project can also be linted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;xbill@penguin:~/gemma4-tips/gpu-26B-6000-devops-agent&lt;span class="nv"&gt;$ &lt;/span&gt;make lint
ruff check &lt;span class="nb"&gt;.&lt;/span&gt;
All checks passed!
ruff format &lt;span class="nt"&gt;--check&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
6 files already formatted
mypy &lt;span class="nb"&gt;.&lt;/span&gt;
Success: no issues found &lt;span class="k"&gt;in &lt;/span&gt;6 &lt;span class="nb"&gt;source &lt;/span&gt;files
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And a test run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;xbill@penguin:~/gemma4-tips/gpu-26B-6000-devops-agent&lt;span class="nv"&gt;$ &lt;/span&gt;make &lt;span class="nb"&gt;test
&lt;/span&gt;python test_agent.py
2026-05-29 18:09:59,458 - vllm-devops-agent - INFO - Initializing DevOps Agent MCP Server...
.........2026-05-29 18:09:59,501 - vllm-devops-agent - INFO - Querying Cloud Run model with prompt: &lt;span class="s1"&gt;'Hello...'&lt;/span&gt;
2026-05-29 18:09:59,501 - vllm-devops-agent - INFO - Model response: &lt;span class="s1"&gt;'Response from Gemma...'&lt;/span&gt;
.2026-05-29 18:09:59,502 - vllm-devops-agent - INFO - Querying model with stats with prompt: &lt;span class="s1"&gt;'Hello...'&lt;/span&gt;
2026-05-29 18:09:59,503 - vllm-devops-agent - INFO - Model response with stats: &lt;span class="nv"&gt;TTFT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.000s, &lt;span class="nv"&gt;TotalTime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.000s
.......
&lt;span class="nt"&gt;----------------------------------------------------------------------&lt;/span&gt;
Ran 17 tests &lt;span class="k"&gt;in &lt;/span&gt;0.024s

OK
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  MCP stdio Transport
&lt;/h4&gt;

&lt;p&gt;One of the key features that the MCP protocol provides is abstracting various transport methods.&lt;/p&gt;

&lt;p&gt;The high level tool MCP implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.&lt;/p&gt;

&lt;p&gt;The simplest transport that the SDK supports is the stdio (stdio/stdout) transport — which connects a locally running process. Both the MCP client and MCP Server must be running in the same environment.&lt;/p&gt;

&lt;p&gt;In this project Antigravity CLI is used as the MCP client to interact with the Python MCP server code.&lt;/p&gt;

&lt;h4&gt;
  
  
  Antigravity CLI mcp_config.json
&lt;/h4&gt;

&lt;p&gt;A sample MCP server file is provided in the .agents directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"gpu-26b-6000-devops-agent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"/home/xbill/gemma4-tips/gpu-26B-6000-devops-agent/server.py"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"GOOGLE_CLOUD_PROJECT"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aisprint-491218"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"GOOGLE_CLOUD_LOCATION"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"us-central1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"VLLM_BASE_URL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://gpu-26b-6000-devops-agent-wgcq55zbfq-uc.a.run.app"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"MODEL_NAME"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/mnt/models/gemma-4-26B-A4B-it"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Validation with Antigravity CLI
&lt;/h4&gt;

&lt;p&gt;The final connection test uses Antigravity CLI as a MCP client with the Python code providing the MCP server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MCP Servers

Plugins (~/.gemini/antigravity-cli/plugins)
&amp;gt; ✓ google-dev-knowledge Tools: search_documents, answer_query, get_documents
   ✓ gpu-26b-6000-devops-agent Tools: save_hf_token, get_vllm_endpoint, list_vertex_models, list_bucket_models,
                                analyze_cloud_logging, +19 more
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Getting Started with Gemma 4 on GPU
&lt;/h4&gt;

&lt;p&gt;The Official vLLM repo also has Gemma4 specific information:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/vllm-project/vllm/releases" rel="noopener noreferrer"&gt;Releases · vllm-project/vllm&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Model Lifecycle Management via MCP
&lt;/h4&gt;

&lt;p&gt;The MCP tools provide a complete suite of agent-oriented operations for managing vLLM deployment on Cloud Run or a TPU.&lt;/p&gt;

&lt;p&gt;Overview of MCP tools :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; get_help

● gpu-26b-6000-devops-agent/get_help(Get help information) (ctrl+o to expand)

  Here is the help and active configuration information for the Gemma 4 DevOps/SRE Agent:

  ### ⚙️ Active Environment Configuration

  • GCP Project ID ( GOOGLE_CLOUD_PROJECT ): aisprint-491218 
  • GCP Region ( GOOGLE_CLOUD_LOCATION ): us-central1 
  • GCS Model Bucket ( BUCKET_NAME ): aisprint-491218-bucket 
  • Model Repo/Path ( MODEL_NAME ): /mnt/models/gemma-4-26B-A4B-it 
  • Cloud Run URL ( VLLM_BASE_URL ): https://gpu-26b-6000-devops-agent-wgcq55zbfq-uc.a.run.app
  • Mode: Cloud Run mode (targeting NVIDIA RTX PRO 6000 GPU in us-central1 ).
  ──────
  ### 🧰 Available MCP Tools

  #### 🐳 Infrastructure &amp;amp; Deployment

  • deploy_vllm : Deploys vLLM to Cloud Run GPU (NVIDIA RTX PRO 6000 in us-central1 ).
  • destroy_vllm : Deletes the Cloud Run vLLM service.
  • status_vllm : Checks the status of the Cloud Run vLLM service.
  • update_vllm_scaling : Updates min/max instances for scaling.
  • get_vllm_deployment_config : Generates the gcloud deployment command.
  • get_vllm_gpu_deployment_config : Generates a GKE manifest for GPU (NVIDIA L4).
  • check_gpu_quotas : Checks RTX 6000 and other GPU quotas for a region.

  #### 📦 Model Management

  • list_vertex_models : Lists models in the Vertex AI Registry.
  • list_bucket_models : Lists model weights in the GCS bucket.
  • save_hf_token : Securely saves a Hugging Face API token to Secret Manager.
  • get_vertex_ai_model_copy_instructions : Instructions to copy models from Vertex AI Model Garden to GCS.
  • get_huggingface_model_copy_instructions : Instructions to download from Hugging Face and upload to GCS.
  • get_huggingfacehub_download_path : Resolves the local cache path using huggingface_hub .

  #### 📊 Monitoring &amp;amp; Status

  • get_system_status : Provides a high-level status dashboard of the Cloud Run service and health.
  • get_endpoint : Verifies connectivity and returns the active service URL.
  • get_model_details : Retrieves detailed model metadata and engine state from /v1/models .
  • verify_model_health : Deep health check by querying the model with a simple prompt and measuring latency.

  #### 📈 Performance &amp;amp; Benchmarking

  • run_benchmark : Runs performance/concurrency benchmark sweeps against the Cloud Run vLLM GPU endpoint.

  #### 💬 Interaction &amp;amp; Diagnostics

  • query_gemma4 : Primary tool to query the self-hosted model with standard chat message format.
  • query_gemma4_with_stats : Queries the model and returns streaming performance statistics (TTFT, throughput).
  • query_vllm : Direct text completions querying tool.
  • analyze_cloud_logging : Fetches logs from GCP Logging and analyzes them using the model.
  • analyze_gpu_logs : Fetches Cloud Run logs and uses Gemma 4 to analyze them for SRE/DevOps errors.
  • suggest_sre_remediation : Suggests remediation plans for SRE errors using the model.

&amp;gt; list mcp tools

  Here is the list of available MCP servers and their corresponding tools:

  ### 1. gpu-26b-6000-devops-agent (GPU 26B DevOps Agent)

  • deploy_vllm : Deploys vLLM to Cloud Run GPU (NVIDIA RTX PRO 6000 in us-central1 ).
  • destroy_vllm : Deletes the Cloud Run vLLM service.
  • status_vllm : Checks the status of the Cloud Run vLLM service.
  • update_vllm_scaling : Updates min/max instances for scaling.
  • get_vllm_deployment_config : Generates the gcloud deployment command.
  • get_vllm_gpu_deployment_config : Generates a GKE manifest for GPU (NVIDIA L4).
  • check_gpu_quotas : Checks RTX 6000 and other GPU quotas for a region.
  • list_vertex_models : Lists models in the Vertex AI Registry.
  • list_bucket_models : Lists model weights in the GCS bucket.
  • save_hf_token : Securely saves a Hugging Face API token to Secret Manager.
  • get_vertex_ai_model_copy_instructions : Instructions to copy models from Vertex AI Model Garden to GCS.
  • get_huggingface_model_copy_instructions : Instructions to download from Hugging Face and upload to GCS.
  • get_huggingfacehub_download_path : Resolves the local cache path using huggingface_hub .
  • get_system_status : Provides a high-level status dashboard of the Cloud Run service and health.
  • get_endpoint : Verifies connectivity and returns the active service URL.
  • get_model_details : Retrieves detailed model metadata and engine state from /v1/models .
  • verify_model_health : Deep health check by querying the model with a simple prompt and measuring latency.
  • run_benchmark : Runs performance/concurrency benchmark sweeps against the Cloud Run vLLM GPU endpoint.
  • query_gemma4 : Primary tool to query the self-hosted model with standard chat message format.
  • query_gemma4_with_stats : Queries the model and returns streaming performance statistics (TTFT, throughput).
  • query_vllm : Direct text completions querying tool.
  • analyze_cloud_logging : Fetches logs from GCP Logging and analyzes them using the model.
  • analyze_gpu_logs : Fetches Cloud Run logs and uses Gemma 4 to analyze them for SRE/DevOps errors.
  • suggest_sre_remediation : Suggests remediation plans for SRE errors using the model.
  • get_help : Gets help information, environment configurations, and active mode details.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Deploy The Model
&lt;/h4&gt;

&lt;p&gt;The MCP server tools are used to deploy the model to Cloud Run with TPU:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;deploy the model

  ### Deployment Status Summary

  The Gemma 4 26B model has been successfully deployed to Google Cloud Run in us-central1 using the NVIDIA RTX 6000 (
  nvidia-rtx-pro-6000 ) GPU accelerator!

  Here is the current state of the deployments:

  1. Active &amp;amp; Serving Revision ( vllm-gemma-4-26b-a4b-it-00003-zfm ) — 100% Traffic:
      • Status: Healthy &amp;amp; Active (Successfully booted and serving inference).
      • GPU Accelerator: nvidia-rtx-pro-6000 (NVIDIA L4 RTX 6000)
      • Resources: 20 vCPUs, 80 GiB memory, 1 GPU.
      • vLLM Configuration: Loaded weights directly from Hugging Face using the HF_TOKEN secret. The startup,
      initialization, and Triton attention backend compilations completed successfully.
      • Service Endpoint: https://vllm-gemma-4-26b-a4b-it-289270257791.us-central1.run.app 
  2. GCS FUSE Revision ( vllm-gemma-4-26b-a4b-it-00004-2nk ) — Provisioning:
      • Status: Loading Safetensors Shards (Mounting from aisprint-491218-bucket/gemma-4-26B-A4B-it/ ).
      • We successfully cached the model weights to the GCS bucket ( gs://aisprint-491218-bucket/gemma-4-26B-A4B-it/ )
      and deployed using GCS FUSE volume mounting.
      • Note: Loading via GCS FUSE is experiencing slower shard loading speeds due to GCS FUSE random access ( mmap )
      latency overhead. Since Revision 00003-zfm is fully ready and serving 100% of the traffic, you can query the
      endpoint immediately.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Checking System status
&lt;/h4&gt;

&lt;p&gt;The status can be checked with an MCP tool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; check_status

● gpu-26b-6000-devops-agent/get_system_status(System status check) (ctrl+o to expand)

  The system is running and fully operational:

  • vLLM Health: 🟢 Online ( https://gpu-26b-6000-devops-agent-289270257791.us-central1.run.app )
  • Cloud Run Service Status: 🟢 Ready

  You can interact with the model using query_gemma4 .
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Cross Check The Deployed Model
&lt;/h4&gt;

&lt;p&gt;Once the model starts locally- the MCP tool allows for model verification:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
 &amp;gt; verify_model_health

● gpu-26b-6000-devops-agent/verify_model_health(Model health verification) (ctrl+o to expand)

  The model health check has passed successfully:

  • Status: ✅ PASSED
  • Target Model: gpu-26b-6000-devops-agent 
  • Latency: 0.17 seconds
  • Model Response: "Yes, I am working! How can I help you today?..."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and model details:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; get_model_details

● gpu-26b-6000-devops-agent/get_model_details(Model details retrieval) (ctrl+o to expand)

  Here are the details for the currently active model:

  ### 🧩 Model Details

  • Endpoint: https://gpu-26b-6000-devops-agent-289270257791.us-central1.run.app 
  • Health Status ( /health ): Healthy ✅

  #### Model Information ( /v1/models )

    [
      {
        "id": "gpu-26b-6000-devops-agent",
        "object": "model",
        "owned_by": "vllm"
      }
    ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and the endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; get_endpoint

● gpu-26b-6000-devops-agent/get_endpoint(Endpoint retrieval) (ctrl+o to expand)

  The Cloud Run vLLM endpoint is:

  🟢 https://gpu-26b-6000-devops-agent-289270257791.us-central1.run.app 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and status:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; get_system_details

● gpu-26b-6000-devops-agent/get_help(Agent help query) (ctrl+o to expand)

  Here are the active configuration and system details:

  ### ⚙️ System Configuration

  • Active Mode: Cloud Run (NVIDIA RTX PRO 6000 GPU) in region us-central1 
  • GCP Project ID ( GOOGLE_CLOUD_PROJECT ): aisprint-491218 
  • GCP Region ( GOOGLE_CLOUD_LOCATION ): us-central1 
  • GCS Bucket Name ( BUCKET_NAME ): aisprint-491218-bucket 
  • Model Path ( MODEL_NAME ): /mnt/models/gemma-4-26B-A4B-it 
  • vLLM Endpoint URL ( VLLM_BASE_URL ): https://gpu-26b-6000-devops-agent-289270257791.us-central1.run.app 

  ### 📊 Deployment Status

  • vLLM Health: 🟢 Online
  • Cloud Run Service Status: 🟢 Ready

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check the Cloud Run deployment status:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; show the cloud run docker status

● ListDir&lt;span class="o"&gt;(&lt;/span&gt;/home/xbill/gemma4-tips/gpu-26B-6000-devops-agent&lt;span class="o"&gt;)&lt;/span&gt;
● Bash&lt;span class="o"&gt;(&lt;/span&gt;gcloud run services describe gpu-26b-6000-devops-agent &lt;span class="nt"&gt;--region&lt;/span&gt; us-central1 &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"yaml"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;ctrl+o to &lt;span class="nb"&gt;expand&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

  Here is the detailed status and configuration of the Docker container running on Cloud Run:

  &lt;span class="c"&gt;### 🐳 Docker &amp;amp; Container Configuration&lt;/span&gt;

  • Docker Image: vllm/vllm-openai:latest 
  • Target Port: 8080 &lt;span class="o"&gt;(&lt;/span&gt;HTTP&lt;span class="o"&gt;)&lt;/span&gt;
  • Service Account: 289270257791-compute@developer.gserviceaccount.com 

  &lt;span class="c"&gt;#### ⚙️ Resources &amp;amp; Limits&lt;/span&gt;

  • CPU Limit: 20 CPUs
  • Memory Limit: 80Gi 
  • GPU Limit: 1 &lt;span class="o"&gt;(&lt;/span&gt;NVIDIA RTX PRO 6000 via nvidia-rtx-pro-6000 node selector&lt;span class="o"&gt;)&lt;/span&gt;
  • Container Concurrency: 16 
  • Timeout: 3600 seconds &lt;span class="o"&gt;(&lt;/span&gt;1 hour&lt;span class="o"&gt;)&lt;/span&gt;

  &lt;span class="c"&gt;#### 📦 Volume Mounts&lt;/span&gt;

  • Driver: gcsfuse.run.googleapis.com &lt;span class="o"&gt;(&lt;/span&gt;GCS FUSE&lt;span class="o"&gt;)&lt;/span&gt;
  • Bucket: aisprint-491218-bucket 
  • Mount Path: /mnt/models &lt;span class="o"&gt;(&lt;/span&gt;mounted read-only&lt;span class="o"&gt;)&lt;/span&gt;

  &lt;span class="c"&gt;#### 🚀 Execution Command&lt;/span&gt;

    bash &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"vllm serve /mnt/models/gemma-4-26B-A4B-it &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --served-model-name gpu-26b-6000-devops-agent &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --enable-log-requests &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --enable-chunked-prefill &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --enable-prefix-caching &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --generation-config auto &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --enable-auto-tool-choice &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --tool-call-parser gemma4 &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --reasoning-parser gemma4 &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --dtype bfloat16 &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --quantization fp8 &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --kv-cache-dtype fp8 &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --max-num-seqs 8 &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --gpu-memory-utilization 0.95 &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --tensor-parallel-size 1 &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --load-format runai_streamer &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --port 8080 &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --host 0.0.0.0 &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --max-model-len 32767"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Check the Cloud Run Service
&lt;/h4&gt;

&lt;p&gt;Once Deployed — the model is visible as a standard Cloud Run Service:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpdhqxpts8gf5dlynwqvv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpdhqxpts8gf5dlynwqvv.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Review the Model
&lt;/h4&gt;

&lt;p&gt;Check the model details:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; what are the model details?

&amp;gt; what are the model details?

● gpu-26b-6000-devops-agent/get_model_details(Model details retrieval) (ctrl+o to expand)

  The model details are as follows:

  • Serving Endpoint: https://gpu-26b-6000-devops-agent-289270257791.us-central1.run.app 
  • Model ID (served name): gpu-26b-6000-devops-agent 
  • Owner: vllm 
  • Engine Health Status: Healthy ✅
  • Served Model Source Path: /mnt/models/gemma-4-26B-A4B-it (loaded from the bucket aisprint-491218-bucket via GCS
  FUSE).
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once everything is online — you can send queries to the Model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; query_gemma4_with_stats what is gemma?
● gpu-26b-6000-devops-agent/query_gemma4_with_stats(Query model with stats) (ctrl+o to expand)

  Here are the performance stats and the model's explanation of Gemma:

  ### 📊 Performance Stats

  • Model: gpu-26b-6000-devops-agent 
  • Time to First Token (TTFT): 0.123s 
  • Total Generation Time: 4.329s 
  • Throughput: 169.07 tokens/s 
  • Total Tokens (approx.): 711 
  ──────
  ### 💬 Model Response

  Gemma is a family of lightweight, state-of-the-art open models developed by Google DeepMind.

  If you are familiar with Google’s Gemini (their massive, closed-source AI), you can think of Gemma as the "smaller,
  open sibling." While Gemini is built to power massive consumer applications, Gemma is built for developers,
  researchers, and enthusiasts to run on their own hardware.

  Here is a breakdown of what makes Gemma significant:

  #### 1. "Open Weights" vs. "Open Source"

  It is technically more accurate to call Gemma an open weights model rather than fully "open source."

  • What this means: Google has released the "brain" (the weights/parameters) of the model. This allows you to download
  it, run it on your own laptop or private server, and fine-tune it for specific tasks (like medical coding or creative
  writing).
  • The Difference: You can use it freely, but you don't necessarily have access to the raw training data or the exact
  proprietary training code used by Google.

  #### 2. Built from Gemini Technology

  Gemma was built using the same research, technology, and infrastructure used to create the Gemini models. This means
  that even though Gemma is much smaller, it inherits much of the "reasoning" capability, mathematical ability, and
  linguistic nuance found in Google's most powerful AI.

  #### 3. Key Characteristics

  • Lightweight and Efficient: Unlike massive models that require a room full of servers, Gemma is designed to be small
  enough to run on a high-end consumer laptop or a single GPU.
  • Variety of Sizes: Gemma is released in different sizes (measured in "parameters," such as 2B, 7B, 9B, or 27B).
      • A 2B (2 billion) model is tiny and can run on a phone or a basic laptop.
      • A 27B (27 billion) model is much more intelligent but requires more powerful hardware.
  • Multimodal Capabilities: Newer versions (like PaliGemma) are designed to understand both text and images.

  #### 4. Why does Gemma exist? (The Purpose)

  Google released Gemma to support the global developer community. By providing Gemma, they are:

  • Empowering Developers: Allowing people to build specialized AI apps without paying massive API fees to Google or
  OpenAI.
  • Research: Enabling scientists to study how LLMs (Large Language Models) work, which is harder to do with "closed"
  models like GPT-4.
  • Customization: Allowing a company to take Gemma and "teach" it their specific private data without that data ever
  leaving their secure servers.

  #### Summary Table

   Feature | Gemini | Gemma
  -------------------------------|----------------------------------------|--------------------------------------------
   Access | Closed (via API/App) | Open (Downloadable weights)
   Size | Massive (Trillions of parameters) | Small to Medium (Billions)
   Hardware | Google's Supercomputers | Laptops, Desktops, Private Clouds
   Best For | General purpose, high-complexity tasks | Specific tasks, local privacy, development

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Benchmark the Local Model
&lt;/h4&gt;

&lt;p&gt;The local model was benchmarked:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The benchmark sweep has completed successfully. Here are the compiled results across all combinations of context
  window sizes and concurrency levels:

I have completed the benchmark sweep and generated a full performance report.

The benchmark tested 1 to 256 concurrent users across context windows ranging from 8 to 16,384 tokens on the vLLM
  Cloud Run instance (NVIDIA RTX 6000 GPU).

### 📈 Benchmark Results Summary

• Continuous Batching Scaling: The RTX 6000 GPU handles high concurrency remarkably well. For a standard 512-token   
  context, aggregate throughput scaled from 138.52 tokens/s (1 user) to 1,136.26 tokens/s (256 concurrent users).
  • Large Context Performance (16,384 tokens): Even under maximum context sizes (16k tokens) and extreme concurrency
  (256 users), the model achieved an aggregate throughput of 849.39 tokens/s with a 100% success rate.
  • Low Latency: Average latency stays under 2.2 seconds for up to 16 concurrent users across all context lengths.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The analysis is reviewed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;  Here is a hardware-focused Deep Dive comparing the NVIDIA RTX 6000 (96GB/48GB VRAM) and NVIDIA L4 (24GB VRAM) for
  serving Gemma 4 models.
  ──────
  ### 📊 GPU Performance Comparison Chart

  🖼 Image: NVIDIA RTX 6000 vs NVIDIA L4 Performance Comparison → /home/xbill/.gemini/antigravity-cli/brain/2d9e479f-
  08d1-
  4cb8-826d-f51e949ae228/gpu_comparison_chart.png
  ──────
  ### 🔍 Architectural Comparison: RTX 6000 vs. NVIDIA L4

   Specification | NVIDIA RTX 6000 | NVIDIA L4
  --------------------------------------|-------------------------------------|----------------------------------------
   VRAM | 96 GB (or 48 GB) GDDR6 ECC | 24 GB GDDR6
   Memory Bandwidth | 960 GB/s | 300 GB/s
   FP32 Compute | 91.1 TFLOPS | 30 TFLOPS
   Tensor Cores | 568 (Gen 4) | 242 (Gen 4)
   Power Consumption | 300W | 72W
   Best Target Model | Gemma 4 26B / 31B (Large models) | Gemma 4 E4B / 9B (Medium/small models)
  ──────
  ### 💡 Key Takeaways from the Deep Dive

  #### 1. Memory Bandwidth Bottlenecks

  • Large Language Models (LLMs) are highly memory bandwidth-bound during the token-generation phase (decoding). Every
  single token generation cycle requires reloading the entire model's weights from high-speed memory into the
  processing cores.
  • The RTX 6000 (960 GB/s) has 3.2x higher memory bandwidth than the NVIDIA L4 (300 GB/s). This means it can fetch
  weights and process individual sequence updates 3 times faster, directly resulting in the extremely high tokens-per-
  second outputs observed in the benchmark.

  #### 2. VRAM Capacity &amp;amp; Batch Size Limits

  • The NVIDIA L4 has only 24 GB of VRAM. To serve a model like Gemma 4 26B, you are forced to use heavy quantization
  (such as FP4 or INT4), which degrades model accuracy.
  • Even with quantization, the L4 will have very little remaining VRAM for the KV cache, capping maximum concurrency
  (often failing or queueing heavily above 8–16 concurrent users).
  • The RTX 6000 (96 GB) allows loading the model with full fidelity (BF16) or using FP8 quantization to leave up to 70
  GB of VRAM dedicated entirely to the KV Cache. This allows for massive concurrent execution up to 256 users at a 16k
  context window with 100% success rate.

  #### 3. Core Compute and Throughput Scaling

  • With 568 Gen-4 Tensor Cores, the RTX 6000 provides massive parallel matrix multiplication capabilities. Under heavy
  concurrency, the GPU can continuous-batch hundreds of sequences together, keeping the Tensor cores highly saturated
  and hitting peak throughput numbers above 1,100 tokens/second.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is visualized:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo1g8w3ilnazk6b2m7mfl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo1g8w3ilnazk6b2m7mfl.png" width="800" height="350"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Summary
&lt;/h4&gt;

&lt;p&gt;The strategy for using MCP for Gemma 4 GPU deployment with Antigravity CLI was validated with a incremental step by step approach.&lt;/p&gt;

&lt;p&gt;A minimal stdio transport MCP Server was started from Python source code and validated with Antigravity CLI running as a MCP client in the same local environment. This Python server provided all of the management tools to deploy and troubleshoot Cloud Run Model deployments.&lt;/p&gt;

</description>
      <category>antigravitycli</category>
      <category>gemma4</category>
      <category>googlecloudrun</category>
      <category>blackwell6000</category>
    </item>
    <item>
      <title>31B Gemma 4 Deployment with NVIDIA Blackwell 6000, MCP, Cloud Run, and Antigravity CLI</title>
      <dc:creator>xbill</dc:creator>
      <pubDate>Fri, 29 May 2026 21:53:40 +0000</pubDate>
      <link>https://dev.to/gde/31b-gemma-4-deployment-with-nvidia-blackwell-6000-mcp-cloud-run-and-antigravity-cli-27f7</link>
      <guid>https://dev.to/gde/31b-gemma-4-deployment-with-nvidia-blackwell-6000-mcp-cloud-run-and-antigravity-cli-27f7</guid>
      <description>&lt;p&gt;This article provides a step by step deployment guide for Gemma 4 to a Google Cloud Run hosted GPU enabled system,. A suite of Python MCP tools is built to simplify management of the vLLM hosted Gemma 4 deployment with Antigravity CLI.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fprgzxj1uzsqeaii6cvvv.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fprgzxj1uzsqeaii6cvvv.jpeg" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  What is this project trying to Do?
&lt;/h4&gt;

&lt;p&gt;This project is a DevOps/SRE assistant that uses a Gemma 4 model hosted on Cloud Run with GPU. It provides tools to provision the Docker container and deploy the model, as well as for observability and performance testing.&lt;/p&gt;

&lt;p&gt;This project is similar to a previous project that targeted GPU hosted Gemma4 instances on GCP:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://medium.com/google-cloud/gemma-sre-self-hosted-vllm-infrastructure-agent-afda000c0943" rel="noopener noreferrer"&gt;Gemma-SRE: Self-Hosted vLLM Infrastructure Agent&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Antigravity CLI
&lt;/h4&gt;

&lt;p&gt;Antigravity CLI is the follow-on successor to Gemini CLI- the terminal driven, agent assisted coding tool.&lt;/p&gt;

&lt;p&gt;Full details on installing Antigravity CLI are here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://medium.com/google-cloud/getting-started-with-antigravity-cli-26c5da90951f" rel="noopener noreferrer"&gt;Getting Started with Antigravity CLI&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Testing the Antigravity CLI Environment
&lt;/h4&gt;

&lt;p&gt;Once you have all the tools in place- you can test the startup of Antigravity CLI.&lt;/p&gt;

&lt;p&gt;You will need to authenticate with a Google Cloud Project or your Google Account:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;agy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will start the interface:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6mvg4a6tnyt4j8666lb1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6mvg4a6tnyt4j8666lb1.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Full Installation Instructions
&lt;/h4&gt;

&lt;p&gt;The detailed installation instructions for Antigravity CLI are here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://medium.com/google-cloud/getting-started-with-antigravity-cli-26c5da90951f" rel="noopener noreferrer"&gt;Getting Started with Antigravity CLI&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Python MCP Documentation
&lt;/h4&gt;

&lt;p&gt;The official GitHub Repo provides samples and documentation for getting started:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/modelcontextprotocol/python-sdk" rel="noopener noreferrer"&gt;GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Where do I start?
&lt;/h4&gt;

&lt;p&gt;The strategy for starting MCP development for model management is a incremental step by step approach.&lt;/p&gt;

&lt;p&gt;First, the basic development environment is setup with the required system variables, and a working Antigravity CLI configuration.&lt;/p&gt;

&lt;p&gt;Then, a minimal Python MCP Server is built with stdio transport. This server is validated with Antigravity CLI in the local environment.&lt;/p&gt;

&lt;p&gt;This setup validates the connection from Antigravity CLI to the local server via MCP. The MCP client (Antigravity CLI) and the Python MCP server both run in the same local environment.&lt;/p&gt;

&lt;h4&gt;
  
  
  Setup the Basic Environment
&lt;/h4&gt;

&lt;p&gt;At this point you should have a working Python environment and a working Antigravity CLI installation. The next step is to clone the GitHub samples repository with support scripts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~
git clone https://github.com/xbill9/gemma4-tips
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run &lt;strong&gt;init.sh&lt;/strong&gt; from the cloned directory.&lt;/p&gt;

&lt;p&gt;The script will attempt to determine your shell environment and set the correct variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gpu-31B-6000-devops-agent
&lt;span class="nb"&gt;source &lt;/span&gt;init.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your session times out or you need to re-authenticate- you can run the &lt;strong&gt;set_env.sh&lt;/strong&gt; script to reset your environment variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gpu-31B-6000-devops-agent
&lt;span class="nb"&gt;source &lt;/span&gt;set_env.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Variables like PROJECT_ID need to be setup for use in the various build scripts- so the set_env script can be used to reset the environment if you time-out.&lt;/p&gt;

&lt;h4&gt;
  
  
  Model Management Tool with MCP Stdio Transport
&lt;/h4&gt;

&lt;p&gt;One of the key features that the standard MCP libraries provide is abstracting various transport methods.&lt;/p&gt;

&lt;p&gt;The high level MCP tool implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.&lt;/p&gt;

&lt;p&gt;The simplest transport that the SDK supports is the stdio (stdio/stdout) transport — which connects a locally running process. Both the MCP client and MCP Server must be running in the same environment.&lt;/p&gt;

&lt;p&gt;The connection over stdio will look similar to this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Initialize FastMCP server
&lt;/span&gt;&lt;span class="n"&gt;mcp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastMCP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Self-Hosted vLLM DevOps Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Running the Python Code
&lt;/h4&gt;

&lt;p&gt;First- switch the directory with the Python version of the MCP sample code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/gemma4-tips/gpu-31B-6000-devops-agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the release version on the local system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;make &lt;span class="nb"&gt;install
&lt;/span&gt;Processing ./.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The project can also be linted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;xbill@penguin:~/gemma4-tips/gpu-31B-6000-devops-agent&lt;span class="nv"&gt;$ &lt;/span&gt;make lint
ruff check &lt;span class="nb"&gt;.&lt;/span&gt;
All checks passed!
ruff format &lt;span class="nt"&gt;--check&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
6 files already formatted
mypy &lt;span class="nb"&gt;.&lt;/span&gt;
Success: no issues found &lt;span class="k"&gt;in &lt;/span&gt;6 &lt;span class="nb"&gt;source &lt;/span&gt;files
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And a test run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;xbill@penguin:~/gemma4-tips/gpu-31B-6000-devops-agent&lt;span class="nv"&gt;$ &lt;/span&gt;make &lt;span class="nb"&gt;test
&lt;/span&gt;python test_agent.py
2026-05-29 16:36:41,800 - vllm-devops-agent - INFO - Initializing DevOps Agent MCP Server...
.........2026-05-29 16:36:41,850 - vllm-devops-agent - INFO - Querying Cloud Run model with prompt: &lt;span class="s1"&gt;'Hello...'&lt;/span&gt;
2026-05-29 16:36:41,850 - vllm-devops-agent - INFO - Model response: &lt;span class="s1"&gt;'Response from Gemma...'&lt;/span&gt;
.2026-05-29 16:36:41,851 - vllm-devops-agent - INFO - Querying model with stats with prompt: &lt;span class="s1"&gt;'Hello...'&lt;/span&gt;
2026-05-29 16:36:41,851 - vllm-devops-agent - INFO - Model response with stats: &lt;span class="nv"&gt;TTFT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.000s, &lt;span class="nv"&gt;TotalTime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.000s
.......
&lt;span class="nt"&gt;----------------------------------------------------------------------&lt;/span&gt;
Ran 17 tests &lt;span class="k"&gt;in &lt;/span&gt;0.027s

OK
xbill@penguin:~/gemma4-tips/gpu-31B-6000-devops-agent&lt;span class="nv"&gt;$ &lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  MCP stdio Transport
&lt;/h4&gt;

&lt;p&gt;One of the key features that the MCP protocol provides is abstracting various transport methods.&lt;/p&gt;

&lt;p&gt;The high level tool MCP implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.&lt;/p&gt;

&lt;p&gt;The simplest transport that the SDK supports is the stdio (stdio/stdout) transport — which connects a locally running process. Both the MCP client and MCP Server must be running in the same environment.&lt;/p&gt;

&lt;p&gt;In this project Antigravity CLI is used as the MCP client to interact with the Python MCP server code.&lt;/p&gt;

&lt;h4&gt;
  
  
  Antigravity CLI mcp_config.json
&lt;/h4&gt;

&lt;p&gt;A sample MCP server file is provided in the .agents directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"gpu-31b-6000-devops-agent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"/home/xbill/gemma4-tips/gpu-31B-6000-devops-agent/server.py"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"GOOGLE_CLOUD_PROJECT"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aisprint-491218"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"GOOGLE_CLOUD_LOCATION"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"us-central1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"VLLM_BASE_URL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://gpu-31B-6000-devops-agent-289270257791.us-central1.run.app"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"MODEL_NAME"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/mnt/models/gemma-4-31B-it"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Validation with Antigravity CLI
&lt;/h4&gt;

&lt;p&gt;The final connection test uses Antigravity CLI as a MCP client with the Python code providing the MCP server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MCP Servers

Plugins (~/.gemini/antigravity-cli/plugins)
&amp;gt; ✓ google-dev-knowledge Tools: search_documents, answer_query, get_documents
   ✓ gpu-31b-6000-devops-agent Tools: save_hf_token, get_vllm_endpoint, list_vertex_models, list_bucket_models,
                                analyze_cloud_logging, +19 more
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Getting Started with Gemma 4 on GPU
&lt;/h4&gt;

&lt;p&gt;The Official vLLM repo also has Gemma4 specific information:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/vllm-project/vllm/releases" rel="noopener noreferrer"&gt;Releases · vllm-project/vllm&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Model Lifecycle Management via MCP
&lt;/h4&gt;

&lt;p&gt;The MCP tools provide a complete suite of agent-oriented operations for managing vLLM deployment on Cloud Run or a TPU.&lt;/p&gt;

&lt;p&gt;Overview of MCP tools :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gt"&gt;&amp;gt; list mcp tools&lt;/span&gt;

● gpu-31b-6000-devops-agent/get_help(Get help details from agent) (ctrl+o to expand)

  Here is the list of available Model Context Protocol (MCP) tools provided by the gpu-31b-6000-devops-agent server:

  ### 🐳 Infrastructure &amp;amp; Deployment

  • deploy_vllm : Deploys vLLM to Cloud Run GPU (NVIDIA RTX PRO 6000 in us-central1 ).
  • destroy_vllm : Deletes the Cloud Run vLLM service.
  • status_vllm : Checks the status of the Cloud Run vLLM service.
  • update_vllm_scaling : Updates min/max instances for scaling.
  • get_vllm_deployment_config : Generates the gcloud deployment command.
  • get_vllm_gpu_deployment_config : Generates a GKE manifest for GPU (NVIDIA L4).
  • check_gpu_quotas : Checks RTX 6000 and other GPU quotas for a region.

  ### 📦 Model Management

  • list_vertex_models : Lists models in the Vertex AI Registry.
  • list_bucket_models : Lists model weights in the GCS bucket.
  • save_hf_token : Securely saves a Hugging Face API token to Secret Manager.
  • get_vertex_ai_model_copy_instructions : Returns instructions to copy a model from Vertex AI Model Garden to GCS.
  • get_huggingface_model_copy_instructions : Returns instructions to download a model from Hugging Face and upload it
  to GCS.
  • get_huggingfacehub_download_path : Resolves the local cache path using huggingface_hub .

  ### 📊 Monitoring &amp;amp; Status

  • get_system_status : Provides a high-level status dashboard of the Cloud Run service and health.
  • get_endpoint : Verifies connectivity and returns the active service URL.
  • get_model_details : Retrieves detailed model metadata and engine state from /v1/models .
  • verify_model_health : Deep health check by querying the model with a simple prompt and measuring latency.

  ### 📈 Performance &amp;amp; Benchmarking

  • run_benchmark : Runs performance/concurrency benchmark sweeps against the Cloud Run vLLM GPU endpoint.

  ### 💬 Interaction &amp;amp; Diagnostics

  • query_gemma4 : Primary tool to query the self-hosted model with standard chat message format.
  • query_gemma4_with_stats : Queries the model and returns streaming performance statistics (TTFT, throughput).
  • query_vllm : Direct text completions querying tool.
  • analyze_cloud_logging : Fetches logs from GCP Logging and analyzes them using the model.
  • analyze_gpu_logs : Fetches Cloud Run logs and uses Gemma 4 to analyze them for SRE/DevOps errors.
  • suggest_sre_remediation : Suggests remediation plans for SRE errors using the model.
  • get_help : Retrieves the help menu, active environment configuration, and tool mappings.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Deploy The Model
&lt;/h4&gt;

&lt;p&gt;The MCP server tools are used to deploy the model to Cloud Run with TPU:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;  The compilation step and warm-up run have successfully completed:

  • torch.compile took 67.21 seconds.
  • FlashInfer JIT autotuning completed.
  • CUDA graph memory profiling is complete (5/5 CUDA graphs captured).

  The engine is now starting up the OpenAI API router. Port 8080 will be open and the revision will be marked active
  within seconds.

● Bash(make query PROMPT="Hello Gemma 4!") (ctrl+o to expand)

  I have initiated a test query ( make query ) against the new endpoint to verify the vLLM serving stack is fully
  functional and responding to prompts. I'll notify you as soon as the response arrives.

  The deployment is fully complete and functional!

  Here is the final summary:

  • Model Copy: All weights and tokenizer configurations for google/gemma-4-31B-it (totaling 58.28 GiB) were
  successfully transferred from Hugging Face directly to your Google Cloud Storage bucket ( gs://aisprint-491218-      
  bucket/gemma-4-31B-it/ ).
  • Deployment: The Cloud Run GPU service was successfully deployed under the name gpu-31b-6000-devops-agent pointing
  to the newly uploaded 31B model folder.
  • Initialization: The vLLM serving stack initialized, loaded the 31B model using the Model Streamer, successfully
  performed torch.compile graph optimization, and compiled CUDA graphs.
  • Verification: A test query was sent to the deployed Cloud Run endpoint and returned a successful completions
  response structure showing full connectivity and model execution.

  The endpoint is live and serving traffic at:
   https://gpu-31b-6000-devops-agent-289270257791.us-central1.run.app 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Checking System status
&lt;/h4&gt;

&lt;p&gt;The status can be checked with an MCP tool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gt"&gt;&amp;gt; get_system_status&lt;/span&gt;

● gpu-31b-6000-devops-agent/get_system_status(Retrieve system status details) (ctrl+o to expand)

  The system status has been successfully retrieved:

  ### 🌀 GPU Cloud Run System Status

  • vLLM Health: 🟢 Online (vLLM Endpoint https://gpu-31B-6000-devops-agent-289270257791.us-central1.run.app)
  • Cloud Run Service Status: 🟢 Ready

  Next Step: You can use the query_gemma4 tool to interact with the model.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Cross Check The Deployed Model
&lt;/h4&gt;

&lt;p&gt;Once the model starts locally- the MCP tool allows for model verification:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gt"&gt;&amp;gt; verify_model_health&lt;/span&gt;

● gpu-31b-6000-devops-agent/verify_model_health(Verify model health status) (ctrl+o to expand)

  The model health check has passed successfully:

  • Status: ✅ PASSED
  • Model: gpu-31b-6000-devops-agent 
  • Response: "Yes, I am working! How can I help you today?..."
  • Latency: 0.44 seconds

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
&lt;span class="gt"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and model details:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;get&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;model&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;details&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;●&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;gpu&lt;/span&gt;&lt;span class="mi"&gt;-31&lt;/span&gt;&lt;span class="err"&gt;b&lt;/span&gt;&lt;span class="mi"&gt;-6000&lt;/span&gt;&lt;span class="err"&gt;-devops-agent/get_model_details(Retrieve&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;details&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;about&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;serving&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;model)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;(ctrl+o&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;expand)&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="err"&gt;The&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;model&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;details&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;have&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;been&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;retrieved&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;successfully:&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="err"&gt;###&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;🧩&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Model&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Details&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;(vLLM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Endpoint&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;https://gpu&lt;/span&gt;&lt;span class="mi"&gt;-31&lt;/span&gt;&lt;span class="err"&gt;B&lt;/span&gt;&lt;span class="mi"&gt;-6000&lt;/span&gt;&lt;span class="err"&gt;-devops-agent&lt;/span&gt;&lt;span class="mi"&gt;-289270257791&lt;/span&gt;&lt;span class="err"&gt;.us-central&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="err"&gt;.run.app)&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="err"&gt;Model&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Information&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;/v&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="err"&gt;/models&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;):&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gpu-31b-6000-devops-agent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"object"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"owned_by"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"vllm"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="err"&gt;Health&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Status&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;/health&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;):&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="err"&gt;•&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Status:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Healthy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;✅&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and the endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gt"&gt;&amp;gt; get endpoint&lt;/span&gt;

● gpu-31b-6000-devops-agent/get_endpoint(Retrieve the endpoint URL of the serving stack) (ctrl+o to expand)

  The Cloud Run vLLM endpoint is online at:

  🟢 Endpoint URL: https://gpu-31B-6000-devops-agent-289270257791.us-central1.run.app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check the Cloud Run deployment status:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; show the cloud run docker status

● gpu-31b-6000-devops-agent/status_vllm&lt;span class="o"&gt;(&lt;/span&gt;Get status of vLLM Cloud Run service&lt;span class="o"&gt;)&lt;/span&gt;
● Search&lt;span class="o"&gt;(&lt;/span&gt;Search &lt;span class="k"&gt;for &lt;/span&gt;status_vllm&lt;span class="o"&gt;)&lt;/span&gt;
● Read&lt;span class="o"&gt;(&lt;/span&gt;/home/xbill/gemma4-tips/gpu-31B-6000-devops-agent/server.py&lt;span class="o"&gt;)&lt;/span&gt;
● Bash&lt;span class="o"&gt;(&lt;/span&gt;gcloud run services describe gpu-31b-6000-devops-agent &lt;span class="nt"&gt;--project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;aisprint-491218 &lt;span class="nt"&gt;--region&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;us-c...&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;ctrl+o to
&lt;span class="nb"&gt;expand&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

  The container &lt;span class="o"&gt;(&lt;/span&gt;Docker&lt;span class="o"&gt;)&lt;/span&gt; configuration and status &lt;span class="k"&gt;for &lt;/span&gt;the Cloud Run service gpu-31b-6000-devops-agent are:

  • Docker Image: vllm/vllm-openai:latest 
  • Port: 8080 &lt;span class="o"&gt;(&lt;/span&gt;HTTP&lt;span class="o"&gt;)&lt;/span&gt;
  • Resource Limits:
      • CPU: 20 Cores
      • Memory: 80 GiB
      • GPU: 1 NVIDIA GPU &lt;span class="o"&gt;(&lt;/span&gt;RTX Pro 6000&lt;span class="o"&gt;)&lt;/span&gt;
  • Volume Mounts: /mnt/models mounted to model-volume 
  • Startup Probe: Starts checking after 240 seconds, with a 15-second period and 120 failure threshold &lt;span class="o"&gt;(&lt;/span&gt;designed to
  accommodate large model loading &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="nb"&gt;.&lt;/span&gt;
  • Entrypoint / Args:
    bash &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"vllm serve /mnt/models/gemma-4-31B-it &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --served-model-name gpu-31b-6000-devops-agent &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --enable-log-requests &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --enable-chunked-prefill &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --enable-prefix-caching &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --generation-config auto &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --enable-auto-tool-choice &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --tool-call-parser gemma4 &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --reasoning-parser gemma4 &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --dtype bfloat16 &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --quantization fp8 &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --kv-cache-dtype fp8 &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --max-num-seqs 8 &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --gpu-memory-utilization 0.95 &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --tensor-parallel-size 1 &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --load-format runai_streamer &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --port 8080 &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --host 0.0.0.0 &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      --max-model-len 32767"&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Check the Cloud Run Service
&lt;/h4&gt;

&lt;p&gt;Once Deployed — the model is visible as a standard Cloud Run Service:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcjvug8ei9lokmkaltq2j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcjvug8ei9lokmkaltq2j.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Review the Model
&lt;/h4&gt;

&lt;p&gt;Check the model details:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gt"&gt;&amp;gt; what are the model details?&lt;/span&gt;

● gpu-31b-6000-devops-agent/get_model_details(Retrieve details about the serving model) (ctrl+o to expand)

  The credentials have been refreshed successfully! Here are the model details:

  ### 🧩 Model Details (vLLM Endpoint https://gpu-31B-6000-devops-agent-289270257791.us-central1.run.app)

  Model Information ( /v1/models ):&lt;span class="sb"&gt;

    [
      {
        "id": "gpu-31b-6000-devops-agent",
        "object": "model",
        "owned_by": "vllm"
      }
    ]

&lt;/span&gt;  Health Status ( /health ):

  • Status: Healthy ✅
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once everything is online — you can send queries to the Model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gt"&gt;&amp;gt; query_gemma4_with_stats what is gemma?&lt;/span&gt;

● gpu-31b-6000-devops-agent/query_gemma4_with_stats(Query model for Gemma info and return statistics) (ctrl+o to expand)

  Here are the performance statistics and the model's response:

  ### 📊 Performance Stats

  • Model: gpu-31b-6000-devops-agent 
  • Time to First Token (TTFT): 0.093s 
  • Total Generation Time: 15.658s 
  • Tokens per Second: 39.13 tokens/s 
  • Total Tokens (approx.): 609 

  ### 💬 Model Response

  Gemma is a family of lightweight, state-of-the-art open-weights large language models (LLMs) developed by Google     
  DeepMind.

  Think of Gemma as the "smaller, open sibling" of Gemini, Google's most powerful AI. While Gemini is a closed system
  available via API or chat interface, Gemma is designed to be downloaded and run on your own hardware.

  Here is a detailed breakdown of what makes Gemma unique:

  ### 1. "Open-Weights" vs. "Open Source"

  It is important to note that Gemma is open-weights, not fully open-source.

  • Open-Source usually means the entire training data and code are public.
  • Open-Weights means Google provides the "brain" (the trained parameters) for free, allowing developers to run it
  locally and fine-tune it for specific tasks, but the exact training data remains proprietary.

  ### 2. Relationship to Gemini

  Gemma was built using the same research, technology, and infrastructure used to create the Gemini models. This means
  it inherits much of the reasoning, coding, and mathematics capability of the larger Gemini models, but in a package
  that is small enough to fit on a personal laptop or a workstation.

  ### 3. Key Versions

  Gemma typically comes in different sizes to suit different hardware:

  • Gemma 2B: Extremely small; can run on a mobile phone or a basic laptop.
  • Gemma 9B / 27B: Mid-sized models that offer a balance between high intelligence and performance. The 27B version is
  designed to compete with much larger models (like Llama 3) while remaining efficient.

  ### 4. What can you do with Gemma?

  Because you can host Gemma on your own machine, it is popular for:

  • Privacy: Processing sensitive data without sending it to a cloud server.
  • Customization (Fine-Tuning): A developer can take Gemma and "teach" it a specific medical dataset or a company's
  internal documentation to make it an expert in a niche field.
  • Research: Academics use it to study how LLMs work without needing a supercomputer.
  • Edge Computing: Running AI locally on devices that don't have a constant internet connection.

  ### 5. How does it compare to others?

  Gemma is Google's direct competitor to Meta's Llama (e.g., Llama 3) and Mistral AI's Mistral/Mixtral models. All
  three are fighting to be the "standard" model that developers use to build their own AI applications.

  In short: Gemma is Google's high-performance, portable AI model that allows developers to build and run their own AI 
  apps privately and locally.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Benchmark the Model
&lt;/h4&gt;

&lt;p&gt;The model was benchmarked:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;The benchmark sweep has completed successfully. Here are the compiled results across all combinations of context
  window sizes and concurrency levels:

  The benchmark sweep across 1 to 256 concurrent users and 8 to 16,384 context tokens has successfully completed.

  ### 📂 Generated Files

  • Performance Report: benchmark_report.md
  • Raw CSV Results: matrix_benchmark_results.csv
  • Performance Chart: rtx_6000_benchmark.png
  • Conversation Artifact: benchmark_report.md

  ### 📈 Key Results Summary

  • Continuous Batching Efficiency: The vLLM engine scales exceptionally well under load. Under maximum load (256      
  concurrent users with 16,384 context window), the token generation rate peaked at ~344.96 tokens/sec.
  • Latency Profile: Latency is highly predictable, starting at sub-second/single-digit latency for lower concurrency
  profiles, scaling linearly to approximately 41.43s for 256 concurrent users at 16,384 tokens.
  • Warmup/Cold Starts: During the transition to the 512 token context sweep, a container restart and graph
  compilation occurred, resulting in temporarily lower throughput before graph caching took over and returned the
  system to optimal speed.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The analysis is reviewed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;  ### 🔑 Key Takeaways
&lt;span class="p"&gt;
  1.&lt;/span&gt; Continuous Batching Efficiency ($C=16$ Sweet Spot):
  Scaling concurrency from 1 to 16 increases total token throughput by almost 8x (jumping from ~38 tok/s to ~300
  tok/s) while average latency only increases slightly (from 3.32s to 5.17s).
&lt;span class="p"&gt;  2.&lt;/span&gt; Throughput Saturation:
  The RTX 6000 Pro's token throughput caps out at ~310–345 tokens/sec. Pushing concurrency past 32 up to 256 does
  not yield further throughput gains, but instead results in linear queuing latency (latency reaches up to ~54.87s).
&lt;span class="p"&gt;  3.&lt;/span&gt; Prefix Caching Phenomenon (16,384 tokens):
  Under high load, large context windows (16k tokens) actually completed faster than short context windows (avg latency
  of 41.43s vs 54.87s at 256 concurrency). This is due to vLLM's automatic prefix caching which evaluates the large
  prompt KV cache once and shares it, reducing prefill time to $O(1)$ for subsequent concurrent requests.

  ### 🛠️ SRE Configuration Recommendations

  • Cloud Run Concurrency: Limit instance concurrency to --concurrency=16 or --concurrency=32 to trigger auto-
  scaling to new instances rather than queuing on a single saturated GPU.
  • Prefix Caching: Always run with the --enable-prefix-caching flag enabled to optimize shared prompt contexts (e.g.
  system instructions, repeated log dumps).
  • Startup Delays: Set startup timeouts to at least 240 seconds to accommodate container cold-starts, safetensors
  downloading, and torch.compile graph optimization.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is visualized:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl4e7402p5sjt92r9qyh3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl4e7402p5sjt92r9qyh3.png" width="800" height="350"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Compare to TPU Deployment
&lt;/h4&gt;

&lt;p&gt;The same model was deployed on a v6e-4 TPU for comparison:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;  I have generated a side-by-side performance comparison between the RTX 6000 Pro GPU and Cloud TPU v6e-4.

  📂 Detailed Comparison Report: gpu_tpu_comparison_report.md
  🖼️ Visual Comparison Chart: gpu_tpu_comparison.png

  ### 📊 Key Performance Differences
&lt;span class="p"&gt;
  1.&lt;/span&gt; Peak Throughput (TPU Dominance):
      • For short contexts (8 tokens), TPU v6e-4 reaches ~1,379 tokens/sec compared to the GPU's ~308 tokens/sec (4.5x 
      faster).
      • For long contexts (16,384 tokens), TPU v6e-4 achieves ~3,206 tokens/sec compared to the GPU's ~344 tokens/sec
      (9.3x faster).
&lt;span class="p"&gt;  2.&lt;/span&gt; Latency (Sub-Second Responses):
      • TPU latency stays extremely low under medium load (averaging 0.46s at concurrency 16), whereas the GPU averages
&lt;span class="p"&gt;      5.&lt;/span&gt;17s for the same level.
&lt;span class="p"&gt;  3.&lt;/span&gt; Architectural Fit:
      • NVIDIA RTX 6000 Pro GPU (Cloud Run): Perfect for cost-sensitive, low-to-medium concurrency apps, or workflows
      with high idle periods (leveraging Cloud Run's scale-to-zero model).
      • Cloud TPU v6e-4: Essential for high-traffic enterprise gateways, real-time streaming, and massive batch log-
      parsing operations where high context and high concurrency are standard.

📂 Architectural Deep Dive Report: gpu_tpu_deep_dive.md

  ### 🔬 Core Hardware Findings
&lt;span class="p"&gt;
  1.&lt;/span&gt; Memory Bandwidth Bottleneck (Decode Phase):
  During token generation (decode), execution is entirely memory-bandwidth bound because the model parameters must be
  read from VRAM for every single token. The TPU v6e-4's HBM2e memory supplies ~6.4 TB/s of bandwidth compared to the
  GPU's 960 GB/s (GDDR6). This 6.6x advantage is why the TPU handles high concurrency without saturating.
&lt;span class="p"&gt;  2.&lt;/span&gt; Distributed Processing (Tensor Parallelism):
  The TPU v6e-4 config splits the 31B model across 4 chips ( --tensor-parallel-size 4 ) communicating over an ultra-low
  latency Inter-Chip Interconnect (ICI). The RTX 6000 Pro runs on a single card ( --tensor-parallel-size 1 ),
  processing all weights on a single GPU core.
&lt;span class="p"&gt;  3.&lt;/span&gt; Compiler Efficiency (XLA vs. CUDA):
  TPUs compile the model graph using XLA (Accelerated Linear Algebra), which optimizes matrix operations and minimizes
  operator execution overhead. The GPU relies on CUDA Graphs and torch.compile which, while fast, has higher startup
  graph compilation times and launch latencies.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The results were graphed:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa6telapfxjf6n799b7uy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa6telapfxjf6n799b7uy.png" width="800" height="350"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;and recommendations generated:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;  📂 Recommendations Document: production_recommendations.md

  ### 🚀 Summary of Key Actions
&lt;span class="p"&gt;
  1.&lt;/span&gt; Hardware Strategy:
      • Use RTX 6000 Pro (Cloud Run) for SRE troubleshooting tools, development, and low-traffic applications where
      scale-to-zero saves significant cost.
      • Use Cloud TPU v6e-4 for high-volume enterprise APIs or systems handling massive context windows (16,384+
      tokens) at high concurrency.
&lt;span class="p"&gt;  2.&lt;/span&gt; Cloud Run Tuning:
      • Set --concurrency=16 or --concurrency=32 to avoid saturating a single GPU, directing excess traffic to auto-
      scaled new instances.
      • Configure --timeout=300 and TCP startup probes to at least 240 seconds to allow for weight streaming and
      graph optimization.
&lt;span class="p"&gt;  3.&lt;/span&gt; Prefix Caching Alignment:
      • Keep --enable-prefix-caching active.
      • Organize prompts from static (system parameters, database schemas) to dynamic (user query, timestamped logs) to
      maximize prefix cache hit ratios.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally — a basic cost analsys:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;  I have performed a cost-per-token analysis comparing the NVIDIA RTX 6000 Pro GPU (Cloud Run) to the Cloud TPU v6e-4
  based on GCP's resource pricing models.

  📂 Detailed Cost Analysis Report: cost_per_token_analysis.md

  ### 📊 Cost-per-Million Tokens Comparison (at Concurrency = 16)
&lt;span class="p"&gt;
  1.&lt;/span&gt; Short Context (8 Input Tokens):
      • RTX 6000 Pro GPU: $3.49 per million tokens
      • Cloud TPU v6e-4: $4.99 per million tokens
      • Verdict: GPU is 30% cheaper because the machine-hour cost is lower ($3.76/hr vs $5.50/hr).
&lt;span class="p"&gt;  2.&lt;/span&gt; Long Context (16,384 Input Tokens):
      • RTX 6000 Pro GPU: $4.24 per million tokens
      • Cloud TPU v6e-4: $0.53 per million tokens
      • Verdict: TPU is 8x cheaper because the TPU v6e-4 leverages HBM2e memory bandwidth to maintain massive
      throughput (2,892 tokens/sec vs 246 tokens/sec on GPU).

  ### 💡 SRE Recommendation

  • Spiky/Ad-Hoc Workloads (e.g. On-demand log diagnostics): Use the GPU on Cloud Run. It supports scale-to-zero,
  meaning you only pay the $3.76/hr rate while processing requests, saving substantial money over a TPU VM which must
  run constantly (charging $132/day even when idle).
  • Continuous/High-Throughput Production Workloads: Use the Cloud TPU v6e-4. Running TPU instances continuously is
  significantly more cost-efficient for heavy text processing and long-context analysis.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Summary
&lt;/h4&gt;

&lt;p&gt;The strategy for using MCP for Gemma 4 GPU deployment with Antigravity CLI was validated with a incremental step by step approach.&lt;/p&gt;

&lt;p&gt;A minimal stdio transport MCP Server was started from Python source code and validated with Antigravity CLI running as a MCP client in the same local environment. This Python server provided all of the management tools to deploy and troubleshoot Cloud Run Model deployments.&lt;/p&gt;

</description>
      <category>gpu</category>
      <category>mcps</category>
      <category>runcloud</category>
      <category>antigravitycli</category>
    </item>
    <item>
      <title>Gemma 4 Benchmarking NVIDIA Blackwell RTX 6000 vs L4 on Google Cloud Run</title>
      <dc:creator>xbill</dc:creator>
      <pubDate>Fri, 29 May 2026 16:47:25 +0000</pubDate>
      <link>https://dev.to/gde/gemma-4-benchmarking-nvidia-blackwell-rtx-6000-vs-l4-on-google-cloud-run-1m4p</link>
      <guid>https://dev.to/gde/gemma-4-benchmarking-nvidia-blackwell-rtx-6000-vs-l4-on-google-cloud-run-1m4p</guid>
      <description>&lt;p&gt;This article provides a step by step deployment guide for Gemma 4 to a Google Cloud Run hosted GPU enabled system,. A suite of Python MCP tools is built to simplify management of the vLLM hosted Gemma 4 deployment with Antigravity CLI.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fprgzxj1uzsqeaii6cvvv.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fprgzxj1uzsqeaii6cvvv.jpeg" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  What is this project trying to Do?
&lt;/h4&gt;

&lt;p&gt;This project is a DevOps/SRE assistant that uses a Gemma 4 model hosted on Cloud Run with GPU. It provides tools to provision the Docker container and deploy the model, as well as for observability and performance testing.&lt;/p&gt;

&lt;p&gt;This project is similar to a previous project that targeted GPU hosted Gemma4 instances on GCP:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://medium.com/google-cloud/gemma-sre-self-hosted-vllm-infrastructure-agent-afda000c0943" rel="noopener noreferrer"&gt;Gemma-SRE: Self-Hosted vLLM Infrastructure Agent&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Antigravity CLI
&lt;/h4&gt;

&lt;p&gt;Antigravity CLI is the follow-on successor to Gemini CLI- the terminal driven, agent assisted coding tool.&lt;/p&gt;

&lt;p&gt;Full details on installing Antigravity CLI are here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://medium.com/google-cloud/getting-started-with-antigravity-cli-26c5da90951f" rel="noopener noreferrer"&gt;Getting Started with Antigravity CLI&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Testing the Antigravity CLI Environment
&lt;/h4&gt;

&lt;p&gt;Once you have all the tools in place- you can test the startup of Antigravity CLI.&lt;/p&gt;

&lt;p&gt;You will need to authenticate with a Google Cloud Project or your Google Account:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;agy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will start the interface:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6mvg4a6tnyt4j8666lb1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6mvg4a6tnyt4j8666lb1.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Full Installation Instructions
&lt;/h4&gt;

&lt;p&gt;The detailed installation instructions for Antigravity CLI are here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://medium.com/google-cloud/getting-started-with-antigravity-cli-26c5da90951f" rel="noopener noreferrer"&gt;Getting Started with Antigravity CLI&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Python MCP Documentation
&lt;/h4&gt;

&lt;p&gt;The official GitHub Repo provides samples and documentation for getting started:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/modelcontextprotocol/python-sdk" rel="noopener noreferrer"&gt;GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Where do I start?
&lt;/h4&gt;

&lt;p&gt;The strategy for starting MCP development for model management is a incremental step by step approach.&lt;/p&gt;

&lt;p&gt;First, the basic development environment is setup with the required system variables, and a working Antigravity CLI configuration.&lt;/p&gt;

&lt;p&gt;Then, a minimal Python MCP Server is built with stdio transport. This server is validated with Antigravity CLI in the local environment.&lt;/p&gt;

&lt;p&gt;This setup validates the connection from Antigravity CLI to the local server via MCP. The MCP client (Antigravity CLI) and the Python MCP server both run in the same local environment.&lt;/p&gt;

&lt;h4&gt;
  
  
  Setup the Basic Environment
&lt;/h4&gt;

&lt;p&gt;At this point you should have a working Python environment and a working Antigravity CLI installation. The next step is to clone the GitHub samples repository with support scripts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~
git clone https://github.com/xbill9/gemma4-tips
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run &lt;strong&gt;init.sh&lt;/strong&gt; from the cloned directory.&lt;/p&gt;

&lt;p&gt;The script will attempt to determine your shell environment and set the correct variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gpu-6000-devops-agent
&lt;span class="nb"&gt;source &lt;/span&gt;init.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your session times out or you need to re-authenticate- you can run the &lt;strong&gt;set_env.sh&lt;/strong&gt; script to reset your environment variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gpu-6000-devops-agent
&lt;span class="nb"&gt;source &lt;/span&gt;set_env.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Variables like PROJECT_ID need to be setup for use in the various build scripts- so the set_env script can be used to reset the environment if you time-out.&lt;/p&gt;

&lt;h4&gt;
  
  
  Model Management Tool with MCP Stdio Transport
&lt;/h4&gt;

&lt;p&gt;One of the key features that the standard MCP libraries provide is abstracting various transport methods.&lt;/p&gt;

&lt;p&gt;The high level MCP tool implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.&lt;/p&gt;

&lt;p&gt;The simplest transport that the SDK supports is the stdio (stdio/stdout) transport — which connects a locally running process. Both the MCP client and MCP Server must be running in the same environment.&lt;/p&gt;

&lt;p&gt;The connection over stdio will look similar to this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Initialize FastMCP server
&lt;/span&gt;&lt;span class="n"&gt;mcp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastMCP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Self-Hosted vLLM DevOps Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Running the Python Code
&lt;/h4&gt;

&lt;p&gt;First- switch the directory with the Python version of the MCP sample code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/gemma4-tips/gpu-6000-devops-agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the release version on the local system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight make"&gt;&lt;code&gt;&lt;span class="err"&gt;make&lt;/span&gt; &lt;span class="err"&gt;install&lt;/span&gt;
&lt;span class="err"&gt;Processing&lt;/span&gt; &lt;span class="err"&gt;./.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The project can also be linted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;xbill@penguin:~/gemma4-tips/gpu-6000-devops-agent&lt;span class="nv"&gt;$ &lt;/span&gt;make lint
ruff check &lt;span class="nb"&gt;.&lt;/span&gt;
All checks passed!
ruff format &lt;span class="nt"&gt;--check&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
6 files already formatted
mypy &lt;span class="nb"&gt;.&lt;/span&gt;
Success: no issues found &lt;span class="k"&gt;in &lt;/span&gt;6 &lt;span class="nb"&gt;source &lt;/span&gt;files
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And a test run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;xbill@penguin:~/gemma4-tips/gpu-6000-devops-agent&lt;span class="nv"&gt;$ &lt;/span&gt;make &lt;span class="nb"&gt;test
&lt;/span&gt;python test_agent.py
2026-05-28 17:33:37,347 - vllm-devops-agent - INFO - Initializing DevOps Agent MCP Server...
.........2026-05-28 17:33:37,388 - vllm-devops-agent - INFO - Querying Cloud Run model with prompt: &lt;span class="s1"&gt;'Hello...'&lt;/span&gt;
2026-05-28 17:33:37,388 - vllm-devops-agent - INFO - Model response: &lt;span class="s1"&gt;'Response from Gemma...'&lt;/span&gt;
.2026-05-28 17:33:37,389 - vllm-devops-agent - INFO - Querying model with stats with prompt: &lt;span class="s1"&gt;'Hello...'&lt;/span&gt;
2026-05-28 17:33:37,389 - vllm-devops-agent - INFO - Model response with stats: &lt;span class="nv"&gt;TTFT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.000s, &lt;span class="nv"&gt;TotalTime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.000s
.......
&lt;span class="nt"&gt;----------------------------------------------------------------------&lt;/span&gt;
Ran 17 tests &lt;span class="k"&gt;in &lt;/span&gt;0.022s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  MCP stdio Transport
&lt;/h4&gt;

&lt;p&gt;One of the key features that the MCP protocol provides is abstracting various transport methods.&lt;/p&gt;

&lt;p&gt;The high level tool MCP implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.&lt;/p&gt;

&lt;p&gt;The simplest transport that the SDK supports is the stdio (stdio/stdout) transport — which connects a locally running process. Both the MCP client and MCP Server must be running in the same environment.&lt;/p&gt;

&lt;p&gt;In this project Antigravity CLI is used as the MCP client to interact with the Python MCP server code.&lt;/p&gt;

&lt;h4&gt;
  
  
  Antigravity CLI mcp_config.json
&lt;/h4&gt;

&lt;p&gt;A sample MCP server file is provided in the .agents directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"gpu-6000-devops-agent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"/home/xbill/gemma4-tips/gpu-6000-devops-agent/server.py"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"GOOGLE_CLOUD_PROJECT"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aisprint-491218"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"GOOGLE_CLOUD_LOCATION"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"us-central1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"VLLM_BASE_URL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://gpu-6000-devops-agent-289270257791.us-central1.run.app"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"MODEL_NAME"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/mnt/models/gemma-4-E4B-it"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Validation with Antigravity CLI
&lt;/h4&gt;

&lt;p&gt;The final connection test uses Antigravity CLI as a MCP client with the Python code providing the MCP server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MCP Servers

Plugins (~/.gemini/antigravity-cli/plugins)
&amp;gt; ✓ google-dev-knowledge Tools: search_documents, answer_query, get_documents
   ✓ gpu-6000-devops-agent Tools: save_hf_token, get_vllm_endpoint, list_vertex_models, list_bucket_models,
                            analyze_cloud_logging, +20 more
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Getting Started with Gemma 4 on GPU
&lt;/h4&gt;

&lt;p&gt;Cloud Run GPU is tied to quota limits:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; check cloud run quota in us-central1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will return the current quotas:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;The Cloud Run ( run.googleapis.com ) GPU quotas for us-central1 in your project ( aisprint-491218 ) are as follows:

  ### 1. NVIDIA RTX Pro 6000 GPU Allocation

  • Metric Name: Total Nvidia RTX Pro 6000 GPU allocation without zonal redundancy 
  • Metric Identifier: run.googleapis.com/nvidia_rtx_pro_6000_gpu_allocation_no_zonal_redundancy 
  • Effective Limit: 3000 in us-central1 (due to a producer override set to 3000 ).
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also use the Antigravity CLI to research a quota increase:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gt"&gt;&amp;gt; request RTX Pro 6000 GPUs cloud run quota of 3 in us-central1 with zonal redundancy&lt;/span&gt;

● Bash(gcloud alpha services quota --help || gcloud consumer-quota-metrics --help)
● Bash(gcloud alpha services quota create --help)
● Bash(gcloud alpha services quota create --service=run.googleapis.com --consumer=projects/aisprint-4...) (ctrl+o to
expand)

  The request to set a consumer override of 3 for RTX Pro 6000 GPUs with zonal redundancy in us-central1 returned a
  precondition failure:&lt;span class="sb"&gt;

    ERROR: (gcloud.alpha.services.quota.create) FAILED_PRECONDITION: The consumer override value can only be set
&lt;/span&gt;  between 0 to 0.
    COMMON_QUOTA_CONSUMER_OVERRIDE_TOO_HIGH

  ### Explanation

  This error indicates that the project currently has a producer-defined maximum limit of 0 for this regional metric.
  Consequently, self-service quota increases via gcloud overrides are restricted to 0 .

  To obtain this quota, you must request a quota increase (which raises the producer limit) through the Google Cloud
  Console:
&lt;span class="p"&gt;
  1.&lt;/span&gt; Navigate to the IAM &amp;amp; Admin &amp;gt; Quotas page.
&lt;span class="p"&gt;  2.&lt;/span&gt; Filter for the metric: run.googleapis.com/nvidia_rtx_pro_6000_gpu_allocation .
&lt;span class="p"&gt;  3.&lt;/span&gt; Select the us-central1 region and click Edit Quotas to submit a formal request to Google Cloud Support.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Official vLLM repo also has Gemma4 specific information:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/vllm-project/vllm/releases" rel="noopener noreferrer"&gt;Releases · vllm-project/vllm&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Model Lifecycle Management via MCP
&lt;/h4&gt;

&lt;p&gt;The MCP tools provide a complete suite of agent-oriented operations for managing vLLM deployment on Cloud Run or a TPU.&lt;/p&gt;

&lt;p&gt;Overview of MCP tools :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gt"&gt;&amp;gt; get mcp tools&lt;/span&gt;

● gpu-6000-devops-agent/get_help(MCP Help Retrieval) (ctrl+o to expand)

  Here are the MCP tools available from the gpu-6000-devops-agent server:

  ### 🐳 Infrastructure &amp;amp; Deployment

  • deploy_vllm : Deploys vLLM to Cloud Run GPU.
  • destroy_vllm : Deletes the Cloud Run vLLM service.
  • status_vllm : Checks the status of the Cloud Run vLLM service.
  • update_vllm_scaling : Updates min/max instances for scaling.
  • get_vllm_deployment_config : Generates the gcloud deployment command.
  • get_vllm_gpu_deployment_config : Generates a GKE manifest for GPU (NVIDIA L4).
  • check_gpu_quotas : Checks L4 and other GPU quotas for a region.

  ### 📊 Model Management

  • list_vertex_models : Lists models in the Vertex AI Registry.
  • list_bucket_models : Lists model weights in GCS bucket.
  • save_hf_token : Securely saves a Hugging Face API token to Secret Manager.
  • get_vertex_ai_model_copy_instructions : Instructions to copy model from Vertex AI Model Garden to GCS.
  • get_huggingface_model_copy_instructions : Instructions to download model from Hugging Face and upload to GCS.
  • get_huggingfacehub_download_path : Resolves local cache path using huggingface_hub .

  ### 📊 Monitoring &amp;amp; Status

  • get_system_status : Provides a high-level status dashboard of the Cloud Run service and health.
  • get_endpoint : Verifies connectivity and returns the active service URL.
  • get_model_details : Retrieves detailed model metadata and engine state from /v1/models .
  • verify_model_health : Deep health check by querying the model with a simple prompt and measuring latency.

  ### 📈 Performance &amp;amp; Benchmarking

  • run_benchmark : Runs performance/concurrency benchmark sweeps against the Cloud Run vLLM GPU endpoint.

  ### 💬 Interaction &amp;amp; Diagnostics

  • query_gemma4 : Primary tool to query the self-hosted model with standard chat message format.
  • query_gemma4_with_stats : Queries the model and returns streaming performance statistics (TTFT, throughput).
  • query_vllm : Direct text completions querying tool.
  • analyze_cloud_logging : Fetches logs from GCP Logging and analyzes them using the model.
  • analyze_gpu_logs : Fetches Cloud Run logs and uses Gemma 4 to analyze them for SRE/DevOps errors.
  • suggest_sre_remediation : Suggests remediation plans for SRE errors using the model.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Deploy The Model
&lt;/h4&gt;

&lt;p&gt;The MCP server tools are used to deploy the model to Cloud Run with TPU:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;deploy the model
 ### Deployment Status Summary

The Gemma 4 model has been successfully deployed to Google Cloud Run in us-central1 using the NVIDIA RTX 6000 (
  nvidia-rtx-pro-6000 ) GPU accelerator!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Checking System status
&lt;/h4&gt;

&lt;p&gt;The status can be checked with an MCP tool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gt"&gt;&amp;gt; get_system_status&lt;/span&gt;

● gpu-6000-devops-agent/get_system_status(System Status Retrieval) (ctrl+o to expand)

  Here is the current system status for the GPU Cloud Run deployment:

  • vLLM Health: 🟢 Online (https://gpu-6000-devops-agent-289270257791.us-central1.run.app)
  • Cloud Run Service Status: 🟢 Ready

  Next Step: You can use the query_gemma4 tool to interact with the model.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Cross Check The Deployed Model
&lt;/h4&gt;

&lt;p&gt;Once the model starts locally- the MCP tool allows for model verification:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gt"&gt;&amp;gt; verify_model_health&lt;/span&gt;

● gpu-6000-devops-agent/verify_model_health(Verify Model Health) (ctrl+o to expand)

  The model health check passed successfully:

  • Status: ✅ PASSED
  • Model: gpu-6000-devops-agent 
  • Response: "Hello! Yes, I am working. I am a large language mo..."
  • Latency: 0.41 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and model details:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gt"&gt;&amp;gt; get_active_models&lt;/span&gt;

● gpu-6000-devops-agent/get_model_details(Model Details Retrieval) (ctrl+o to expand)

  Here are the active model details retrieved from the endpoint (https://gpu-6000-devops-agent-289270257791.us-        
  central1.run.app):

  ### 🧩 Active Model Info ( /v1/models )&lt;span class="sb"&gt;

    [
      {
        "id": "gpu-6000-devops-agent",
        "object": "model",
        "owned_by": "vllm"
      }
    ]

&lt;/span&gt;  ### 🩺 Health Status ( /health )

  • Status: Healthy ✅
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and the endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gt"&gt;&amp;gt; get_endpoint&lt;/span&gt;

● gpu-6000-devops-agent/get_endpoint(Endpoint Retrieval) (ctrl+o to expand)

  The active endpoint is online at:

  🟢 https://gpu-6000-devops-agent-289270257791.us-central1.run.app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and status:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gt"&gt;&amp;gt; get_system_details&lt;/span&gt;

● gpu-6000-devops-agent/status_vllm(vLLM Status Retrieval) (ctrl+o to expand)

  Here are the detailed system and service status details for the gpu-6000-devops-agent deployment on Cloud Run:

  ### ⚙️ Cloud Run Service Conditions &amp;amp; Status

  • Overall Status: Ready ✅
  • Configuration Status: ConfigurationsReady (True) ✅
  • Route Status: RoutesReady (True) ✅
  • Latest Created Revision: gpu-6000-devops-agent-00002-hvw 
  • Direct URL: https://gpu-6000-devops-agent-wgcq55zbfq-uc.a.run.app
  • Last Transition Time: 2026-05-29T14:07:55Z 

────────────────────────────────────────────────────────────
&lt;span class="gt"&gt;&amp;gt; which gpu is cloud run using&lt;/span&gt;

● gpu-6000-devops-agent/get_vllm_deployment_config(Deployment Config Retrieval) (ctrl+o to expand)

  Cloud Run is configured to use the NVIDIA RTX 6000 Ada Generation GPU (configured via --gpu-type=nvidia-rtx-pro-6000 
  with
  1 GPU unit).
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check the Cloud Run deployment status:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gt"&gt;
&amp;gt; show the cloud run docker status&lt;/span&gt;

● Bash(docker ps -a)
● gpu-6000-devops-agent/analyze_gpu_logs(GPU Log Analysis)
● Bash(gcloud run services describe gpu-6000-devops-agent --region us-central1 --
format="value(status.latestReadyRevisio...)
● Bash(gcloud run revisions describe gpu-6000-devops-agent-00002-hvw --region us-central1) (ctrl+o to expand)

  Here is the current Docker/Container configuration and status for the active revision gpu-6000-devops-agent-00002-  
  hvw on
  Cloud Run:

  ### 🐳 Container &amp;amp; Image Information

  • Container Name: vllm-openai-1 
  • Docker Image: mirror.gcr.io/vllm/vllm-                                                                            
  openai@sha256:69cf768308bad3a6fde6ffeffc8ba1f28433752c01e9cb53f57bfaf547cec2e1 
  • Entrypoint Command: bash 
  • Arguments:
    -c vllm serve /mnt/models/gemma-4-E4B-it &lt;span class="err"&gt;\&lt;/span&gt;
       --served-model-name gpu-6000-devops-agent &lt;span class="err"&gt;\&lt;/span&gt;
       --enable-log-requests &lt;span class="err"&gt;\&lt;/span&gt;
       --enable-chunked-prefill &lt;span class="err"&gt;\&lt;/span&gt;
       --enable-prefix-caching &lt;span class="err"&gt;\&lt;/span&gt;
       --generation-config auto &lt;span class="err"&gt;\&lt;/span&gt;
       --enable-auto-tool-choice &lt;span class="err"&gt;\&lt;/span&gt;
       --tool-call-parser gemma4 &lt;span class="err"&gt;\&lt;/span&gt;
       --reasoning-parser gemma4 &lt;span class="err"&gt;\&lt;/span&gt;
       --dtype bfloat16 &lt;span class="err"&gt;\&lt;/span&gt;
       --quantization fp8 &lt;span class="err"&gt;\&lt;/span&gt;
       --kv-cache-dtype fp8 &lt;span class="err"&gt;\&lt;/span&gt;
       --max-num-seqs 8 &lt;span class="err"&gt;\&lt;/span&gt;
       --gpu-memory-utilization 0.95 &lt;span class="err"&gt;\&lt;/span&gt;
       --tensor-parallel-size 1 &lt;span class="err"&gt;\&lt;/span&gt;
       --load-format runai_streamer &lt;span class="err"&gt;\&lt;/span&gt;
       --port 8080 &lt;span class="err"&gt;\&lt;/span&gt;
       --host 0.0.0.0 &lt;span class="err"&gt;\&lt;/span&gt;
       --max-model-len 32767

  ### 🎛 Resource Allocation

  • CPU: 20 (Always allocated)
  • Memory: 80Gi 
  • GPU: 1 unit of NVIDIA RTX 6000 ( nvidia-rtx-pro-6000 )
  • Concurrency: 16 requests per container instance
  • Scale Limits: Min instances: 1 , Max instances: 3 

  ### 💾 Volumes

  • Model GCS Bucket: aisprint-491218-bucket mounted at /mnt/models (Read-only, uid/gid: 1001)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Check the Cloud Run Service
&lt;/h4&gt;

&lt;p&gt;Once Deployed — the model is visible as a standard Cloud Run Service:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjak4molcu9ylbs3bdtol.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjak4molcu9ylbs3bdtol.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Review the Model
&lt;/h4&gt;

&lt;p&gt;Check the model details:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gt"&gt;&amp;gt; what are the model details?&lt;/span&gt;

● gpu-6000-devops-agent/list_bucket_models(GCS Bucket Model Listing) (ctrl+o to expand)

  The model currently being served is Gemma 4 E4B-it ( gemma-4-E4B-it ).

  Here are the key details of the active model:

  ### 📁 Model Storage &amp;amp; Location

  • Source Path (in Cloud Storage Bucket): gs://aisprint-491218-bucket/gemma-4-E4B-it/ 
  • Local Container Mount Path: /mnt/models/gemma-4-E4B-it 
  • Model Size: The primary weights file ( model.safetensors ) is 15.25 GB (15,251.73 MB).

  ### ⚙️ vLLM serving configuration:

  • Quantization: fp8 
  • KV Cache Dtype: fp8 
  • Max Model Length: 32,767 tokens
  • Tensor Parallel Size: 1 (runs on a single GPU instance)
  • Serving Name (OpenAI endpoint ID): gpu-6000-devops-agent 

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
&lt;span class="gt"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once everything is online — you can send queries to the Model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gt"&gt;&amp;gt; query_gemma4_with_stats what is gemma4?&lt;/span&gt;

● gpu-6000-devops-agent/query_gemma4_with_stats(Gemma 4 Stats Query) (ctrl+o to expand)

  Here are the performance stats and response from the model:

  ### 📊 Performance Stats

  • Model: gpu-6000-devops-agent 
  • Time to First Token (TTFT): 0.093s 
  • Total Generation Time: 0.570s 
  • Throughput: 161.29 tokens/s 
  • Total Tokens Generated: ~77

  ### 💬 Model Response

  │ I am Gemma 4.
  │
  │ I am a Large Language Model developed by Google DeepMind. I am an open weights model, which means my underlying
  │ architecture and weights are available for the community to use and build upon.
  │
  │ In essence, I am a sophisticated AI designed to understand and generate human-like text based on the vast amounts
  of
  │ data I was trained on.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Benchmark the Local Model
&lt;/h4&gt;

&lt;p&gt;The local model was benchmarked:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;  The matrix benchmark sweep has completed successfully! All 96 execution cells (concurrency levels from 1 to 128 
  across context window sizes from 8 to 16,384 tokens) were swept.

  I've generated a detailed summary and findings report in benchmark_results.md.

  ### 🚀 Key Performance Highlights

  • Stability: 100% Success Rate across all concurrent loads and context sizes.
  • Peak Throughput: Reached ~350–365 tokens/sec at high concurrencies for small context windows (8–32 tokens).
  • Deep Context Scaling: Maintained ~21–25 tokens/sec at the maximum 16,384 context window size under a high load of  
  128 concurrent users.
  • Latency: The average latency stayed under 0.25 seconds even during maximum concurrency/context window testing.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is visualized:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftzk4gs08xxdz239ikg9k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftzk4gs08xxdz239ikg9k.png" width="800" height="375"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Compare L4 vs RTX 6000
&lt;/h4&gt;

&lt;p&gt;The model was previously deployed on a Cloud Run L4 here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://xbill999.medium.com/gemma-4-deployment-with-nvidia-l4-mcp-cloud-run-and-antigravity-cli-03b5a472458d" rel="noopener noreferrer"&gt;Gemma 4 Deployment with NVIDIA L4, MCP, Cloud Run, and Antigravity CLI&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There results were compared:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gt"&gt;&amp;gt; compare results to L4 deployment&lt;/span&gt;
● Search(Find benchmark data)
● Read(/home/xbill/gemma4-tips/gpu-6000-devops-agent/benchmark_report.md) (ctrl+o to expand)

▸ Thought for 2s, 241 tokens
  Analyzing GPU Performance

● Create(/home/xbill/.gemini/antigravity-cli/brain/45...-41cd-9dea-85246ed1e468/comparison_report.md) (ctrl+o to expand)

  I have analyzed and compared your new NVIDIA RTX PRO 6000 Ada GPU benchmark results against the baseline NVIDIA L4
  GPU deployment results.

  The detailed comparative analysis is saved in the comparison_report.md artifact.

  ### 📊 Performance Summary: NVIDIA L4 vs. NVIDIA RTX PRO 6000

  • Peak Throughput: Increased from 80–90 tokens/sec on L4 to 350–365 tokens/sec on the RTX PRO 6000 (~4x throughput   
  speedup).
  • Average Latency: Dropped from ~1.0 second under load on L4 to under 0.25 seconds on the RTX PRO 6000 (~4x latency  
  reduction).
  • Context Capacity: Doubled from 16,384 tokens to 32,767 tokens (made possible by the RTX 6000's memory headroom
  combined with the configured FP8 model and KV-cache quantization).
  • Long Context Throughput (16k): Remains comparable (~21–25 tokens/sec on RTX 6000 vs. ~25–27 tokens/sec on L4), but
  with much lower latency under high user concurrency.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and a deep dive:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gt"&gt;&amp;gt; deep dive into L4 vs RTX PRO 6000 benchmark results&lt;/span&gt;

▸ Thought for 1s, 485 tokens
  Analyzing Benchmark Results

● Create(/home/xbill/.gemini/antigravity-cli/brain/45...1cd-9dea-85246ed1e468/benchmark_deep_dive.md) (ctrl+o to expand)

  I have compiled a detailed, systems-level benchmark comparison analyzing how the hardware specifications of the
  NVIDIA
  L4 and NVIDIA RTX PRO 6000 Blackwell translate directly into the observed benchmark behaviors.

  The complete report is available in the benchmark_deep_dive.md artifact.

  ### 🔍 Summary of Key Architectural Explanations
&lt;span class="p"&gt;
  1.&lt;/span&gt; Memory Bandwidth Bottleneck (Autoregressive Decoding):
      • Observation: The L4 baseline flattens out around 80–90 tokens/sec at concurrency levels above 16. The RTX 6000
      Blackwell scales up to 365 tokens/sec at 128 concurrent users.
      • Explanation: The autoregressive decode phase is memory-bandwidth bound. The RTX 6000 Blackwell's 960 GB/s GDDR7
      bandwidth is 3.2x faster than the L4's 300 GB/s, allowing vLLM to process larger batches without memory stalls.
&lt;span class="p"&gt;  2.&lt;/span&gt; Compute-Bound Prefill (Time to First Token &amp;amp; Latency):
      • Observation: Under high load, average latency on the L4 climbs to ~1.0 second, whereas it stays flat under 0.25
      seconds on the RTX 6000 Blackwell.
      • Explanation: Prompt processing (prefill) is compute-bound. The RTX 6000 Blackwell features 24,064 CUDA cores
      and 1,457 TFLOPS of FP8 performance (vs. 7,424 cores and 242 TFLOPS on L4), computing prompt matrix
      multiplications almost instantaneously.
&lt;span class="p"&gt;  3.&lt;/span&gt; VRAM Headroom for Deep Context (16k vs. 32k):
      • Observation: The L4 hits a hard limit at a 16,384 context window. The RTX 6000 Blackwell scales up to 32,767   
      tokens with high concurrency.
      • Explanation: Large context windows require massive VRAM pools to store KV-caches. The RTX 6000 Blackwell's 96  
      GB VRAM easily accommodates the FP8-quantized model and deep KV-caches, maintaining a stable throughput of ~24.66
      tokens/sec at the maximum 32,767 limit.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and graphed:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F11tpdz8r60u56mjsrbcp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F11tpdz8r60u56mjsrbcp.png" width="800" height="375"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  The Final Verdict? Surprise!
&lt;/h4&gt;

&lt;p&gt;A full cost analysis was done:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;  Analyzing Compute Costs
  To compare the cost efficiency of the NVIDIA L4 vs. the NVIDIA RTX PRO 6000 for serving the Gemma 4 (4B-it) model, we
  break down Google Cloud Run resource pricing (Tier 1 regions) and measure it against the benchmarked throughput.
  ──────
  ### 1. Google Cloud Run Resource Rates (Tier 1 Regions)

  To utilize GPUs on Cloud Run, the CPU and memory must remain allocated during the instance lifecycle.

   Billing Component | Unit Rate (CPU Always Allocated)
  ----------------------------------------------------------|----------------------------------------------------------
   vCPU | $0.00001800 per vCPU-second
   Memory | $0.00000200 per GiB-second
   NVIDIA L4 GPU (No Zonal Redundancy) | $0.00018670 per GPU-second (~$0.67/hr)
   NVIDIA RTX PRO 6000 | $0.00079170 per GPU-second (~$2.85/hr)
  ──────
  ### 2. Minimum Stack Configurations &amp;amp; Hourly Costs

   Stack Component | 🟢 NVIDIA L4 Configuration | 🚀 NVIDIA RTX PRO 6000 Configuration
  ---------------------------------------|--------------------------------------|--------------------------------------
   GPU | 1x NVIDIA L4 ($0.672/hr) | 1x NVIDIA RTX PRO 6000 ($2.85/hr)
   vCPU Allocation | 4 vCPUs ($0.259/hr) | 20 vCPUs ($1.296/hr)
   Memory Allocation | 16 GiB ($0.115/hr) | 80 GiB ($0.576/hr)
   Total Hourly Cost | $1.047 / hour ($0.000291/sec) | $4.722 / hour ($0.001312/sec)
  ──────
  ### 3. Cost Per Token Comparison (Gemma 4 4B-it)

  Using the benchmark results for both setups, we calculate the cost per 1 Million Tokens:

  #### Scenario A: Peak Throughput (Short Context)

  • 🟢 NVIDIA L4: Peak Throughput of 90 tokens/sec
      • Time to generate 1M tokens: $11,111&lt;span class="se"&gt;\t&lt;/span&gt;ext{ seconds}$ (~3.09 hours)
      • Cost per 1M Tokens: $&lt;span class="se"&gt;\m&lt;/span&gt;athbf{&lt;span class="se"&gt;\$&lt;/span&gt;3.23}$
  • 🚀 NVIDIA RTX PRO 6000: Peak Throughput of 365 tokens/sec
      • Time to generate 1M tokens: $2,740&lt;span class="se"&gt;\t&lt;/span&gt;ext{ seconds}$ (~0.76 hours)
      • Cost per 1M Tokens: $&lt;span class="se"&gt;\m&lt;/span&gt;athbf{&lt;span class="se"&gt;\$&lt;/span&gt;3.59}$

  │ [!NOTE]
  │ For peak throughput workloads, the NVIDIA L4 is slightly cheaper (~10% lower cost per token), but the RTX PRO 6000
  │ delivers the tokens 4x faster with 4x lower latency.

  #### Scenario B: Deep Context Boundary (16K Context Window)

  • 🟢 NVIDIA L4: Throughput of 26 tokens/sec
      • Time to generate 1M tokens: $38,462&lt;span class="se"&gt;\t&lt;/span&gt;ext{ seconds}$ (~10.68 hours)
      • Cost per 1M Tokens: $&lt;span class="se"&gt;\m&lt;/span&gt;athbf{&lt;span class="se"&gt;\$&lt;/span&gt;11.18}$
  • 🚀 NVIDIA RTX PRO 6000: Throughput of 23 tokens/sec
      • Time to generate 1M tokens: $43,478&lt;span class="se"&gt;\t&lt;/span&gt;ext{ seconds}$ (~12.08 hours)
      • Cost per 1M Tokens: $&lt;span class="se"&gt;\m&lt;/span&gt;athbf{&lt;span class="se"&gt;\$&lt;/span&gt;57.03}$

  ──────
  ### 🏆 Verdict: Which has the best cost per token?
&lt;span class="p"&gt;
  1.&lt;/span&gt; Overall Winner: NVIDIA L4 GPU
  The NVIDIA L4 has the best cost per token across all scenarios. At 16K context windows, it is over 5x more cost-     
  effective than the RTX 6000 because the RTX 6000 is bound by a minimum compute footprint of 20 vCPUs and 80 GiB
  memory, making its base running cost significantly higher ($4.72/hr vs. $1.05/hr) even when throughput scales down at
  long contexts.
&lt;span class="p"&gt;  2.&lt;/span&gt; Best for Latency/High Concurrency: RTX PRO 6000
  If sub-second latency under high concurrent load is required, the RTX PRO 6000 is only ~11% more expensive per token
  for short contexts while generating tokens 4x faster.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Summary
&lt;/h4&gt;

&lt;p&gt;The strategy for using MCP for Gemma 4 GPU deployment with Antigravity CLI was validated with a incremental step by step approach.&lt;/p&gt;

&lt;p&gt;A minimal stdio transport MCP Server was started from Python source code and validated with Antigravity CLI running as a MCP client in the same local environment. These tools were used to deploy and benchmark the L4 and RTX 6000 GPUS on Google Cloud Run.&lt;/p&gt;

</description>
      <category>googleantigravity</category>
      <category>vllm</category>
      <category>googlecloudrun</category>
      <category>gemma4</category>
    </item>
    <item>
      <title>The Kubernetes Overkill: Why I Built a "K8s Killer" for Small Environments</title>
      <dc:creator>Mario Ezquerro</dc:creator>
      <pubDate>Fri, 29 May 2026 09:48:30 +0000</pubDate>
      <link>https://dev.to/gde/the-kubernetes-overkill-why-i-built-a-k8s-killer-for-small-environments-m9k</link>
      <guid>https://dev.to/gde/the-kubernetes-overkill-why-i-built-a-k8s-killer-for-small-environments-m9k</guid>
      <description>&lt;p&gt;Let’s be honest: Kubernetes is the undisputed king of orchestration. But if you are running a small to medium-sized environment, deploying K8s is often like using a sledgehammer to crack a nut. It brings massive resource overhead and operational complexity that most projects simply do not need.&lt;/p&gt;

&lt;p&gt;Worse yet, despite its size, Kubernetes isn't actually a complete solution out of the box.&lt;/p&gt;

&lt;p&gt;If you want proper SRE management—like native observability, health checks, and Service Level Objectives (SLOs)—you have to bolt on a complex stack of external tools. Need Ingress management to expose your services? You'll have to configure and maintain third-party Ingress controllers just to get traffic flowing.&lt;/p&gt;

&lt;p&gt;That architectural bloat led me to advocate for a more minimalist approach. I wanted an orchestrator that provided everything necessary for robust deployments without the massive footprint.&lt;/p&gt;

&lt;p&gt;Enter Gubernator: built to be the Kubernetes killer for small environments.&lt;/p&gt;

&lt;p&gt;What Makes &lt;a href="https://mario-ezquerro.github.io/gubernator/" rel="noopener noreferrer"&gt;Gubernator&lt;/a&gt; Different?&lt;br&gt;
Gubernator (gbnt) is a lightweight, distributed container orchestrator written entirely in Go. It strips away the unnecessary layers of traditional orchestration and bakes the essential operational requirements directly into its core:&lt;/p&gt;

&lt;p&gt;Native SRE &amp;amp; Observability: Instead of treating SRE as an afterthought, Gubernator integrates robust health checks and observability (leveraging OpenTelemetry and Prometheus) directly into the orchestrator.&lt;/p&gt;

&lt;p&gt;Built-in Ingress Management: No need to wrestle with external controllers. Gubernator handles your ingress routing natively, simplifying how you expose and manage your services.&lt;/p&gt;

&lt;p&gt;True Software Minimalism: By using SQLite for data consistency rather than heavy key-value stores, Gubernator maintains a tiny resource footprint while ensuring reliable state management across multiple hosts.&lt;/p&gt;

&lt;p&gt;Developer-Friendly Interface: A clean REST API and an intuitive CLI tool get your containers running securely and efficiently.&lt;/p&gt;

&lt;p&gt;If you are tired of the Kubernetes learning curve and the overhead of maintaining tools you barely use, it's time to look at alternatives that prioritize simplicity and reliability.&lt;/p&gt;

&lt;p&gt;Check out the repository, documentation, and architecture here: Gubernator on GitHub Pages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://mario-ezquerro.github.io/gubernator/" rel="noopener noreferrer"&gt;https://mario-ezquerro.github.io/gubernator/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I’d love to hear from the community: at what point do you think Kubernetes becomes too big for its own good? Drop your thoughts in the comments!&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>devops</category>
      <category>kubernetes</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
