DEV Community: Google Developer Experts

[Gemini][Agent] Google Managed Agents API

Evan Lin — Wed, 03 Jun 2026 01:01:36 +0000

(Image Source: Google Cloud Docs - Managed Agents on Agent Platform)

Preamble: The era of hand-rolling your own agent loop is coming to an end

In the past, if you wanted to build an AI agent that could truly " do things ", the component list that came to mind probably looked something like this:

An LLM main loop (ReAct? Write your own state machine?)
A sandbox to run LLM-generated code (Docker? Firecracker? E2B?)
A filesystem to store intermediate files produced by the agent (S3? Local? Temporary or persistent?)
A search API (Connect to Google Custom Search yourself? SerpAPI?)
A page fetcher (playwright? readability-lxml?)
A tool router to connect all of the above
And only then, how to let the user continue the session

And once the session broke, the report.md, sources.json that the agent was halfway through writing, and the venv that was halfway running, would all be gone. Nobody wants to do "I'll open a Docker for you, mount a volume, and remember to delete it in 7 days" again.

These past few days, Google has turned this pipeline into " calling a managed API " in Cloud Docs — Gemini Enterprise Agent Platform launched the Managed Agents API (internal codename Antigravity), which manages the sandbox, filesystem, and toolset entirely. Just pass an environment ID, and the agent's intermediate files from last time will still be waiting for you.

This article will do two things:

Break down the core capabilities clearly, including what the underlying antigravity-preview-05-2026 model is doing.
Use an open-source LINE Research Planner Bot (kkdai/line-research-bot) as a live demonstration to see how new features are combined in actual production code — and share the five typical Pre-GA pitfalls I encountered during debugging to help you avoid them.

Three Key Core Capabilities

According to the official documentation, the core of Managed Agents revolves around three things:

1. Persistent Sandbox + Filesystem

In the past, code interpreter-like functions would restart a container with each call, losing all previously pip installed packages, written files, and half-open Python interpreters.

“Each agent operates within a sandboxed environment … capable of reasoning, planning, executing code, web searching, and file operations.”

Now, if you make a second interaction with the same environment_id, the agent will see the /workspace/ from the previous session:

/workspace/sources.json is still there
/workspace/report.md was half-written, this time it continues to modify it
Packages like markdown installed with pip install last time don't need to be reinstalled

For us product builders, this means:

No need to maintain your own sandbox infrastructure (Firecracker, microVM, expiration cleanup).
Agents can truly "complete a big task in multiple turns", instead of starting over each turn.
A TTL of 7 days, during which any interaction automatically refreshes, meaning it stays alive as long as the user uses it once a week.

My LINE Bot relies on this for " progressive deepening ": the user first says "research X" → the agent writes sources and a report in the sandbox; a few minutes later, the user says "Chapter 2, go deeper" → the agent reads back the original file, modifies Chapter 2, and rewrites it, all within the same sandbox and the same markdown file.

2. Built-in Tools

When building an agent, you just list the tools you want, without having to connect to APIs yourself:

tools=[
    {"type": "code_execution"}, # Python / bash / persistent venv
    {"type": "filesystem"}, # Read/write /workspace
    {"type": "google_search"}, # Real Google Search, not Custom Search
    {"type": "url_context"}, # Feed URL to automatically fetch content + extract
    {"type": "mcp_server", # Any plug-in MCP server
     "name": "grep-search",
     "url": "https://mcp.grep.app"},
]

Several key observations:

google_search is real Google, not the basic version that requires you to customize a search engine ID + API key. The return format includes search suggestions and can be used for grounding.
url_context is equivalent to free readability + content extraction, feed a URL and get the main text. No need to maintain another playwright fleet.
Native MCP support: You can directly integrate any Model Context Protocol server. The entire ecosystem is open.

3. Multi-turn Session Chaining

Each interaction returns an id. When calling the next turn, pass it as previous_interaction_id, and the agent will see the entire conversation history + sandbox state:

r1 = client.interactions.create(
    agent="research-planner",
    input="PLAN ...",
    environment={"type": "remote"}, # Open a new sandbox
    background=True,
)
# … poll until completed …

r2 = client.interactions.create(
    agent="research-planner",
    input="SEARCH_COMPARE", # No need to restate context
    environment=r1.environment_id, # Reuse sandbox
    previous_interaction_id=r1.id, # Connect history
    background=True,
)

This design turns your backend into " only responsible for deciding what prompt to send each turn ". Session state, conversation history, and file system are all server-side managed.

Two APIs: Agents for Control Plane, Interactions for Data Plane

The documentation divides into two APIs, with clear responsibilities:

API	Path	What it does
Agents API	`/projects/.../agents`	Create, update, delete agent settings (base_agent, tools, system_instruction)
Interactions API	`/projects/.../interactions:create`	Interact with deployed agents

Simply put: Agents = Configuration, Interactions = Execution. Creating an agent is a one-time task; running interactions is done every time a user message comes in. My LINE Bot only used the Agents API once during deployment to create the agent, and after that, Cloud Run only calls the Interactions API.

The underlying base model is hardcoded as antigravity-preview-05-2026, which is an agent-optimized version of the Gemini series (only this one is available during the Pre-GA preview period).

What Developers Truly Care About: Cost and Integration Cost

This API is still in Pre-GA, and the official documentation emphasizes:

“Antigravity is offered as Pre-General Availability software, which means it is not subject to any SLA or deprecation policy. Antigravity is not intended for production use or for use with sensitive data.”

In plain language:

Cannot be used for production sensitive data (for compliance scenarios, please wait for GA).
No SLA, the API shape might change someday.
Might be discontinued someday, don't bet your company's life on it.
Billing is at standard Vertex AI rates, with no additional sandbox runtime fees — this is super friendly for demos / internal tools / hackathons.

It's a very suitable entry point for personal side projects and POCs — you don't need to spend a month setting up sandbox infra yourself to build an agent that can get things done. But don't throw enterprise customer data into it.

Standard Workflow: 4 SDK Calls to Complete an Agent Interaction

The minimum viable flow after organizing the official colab (intro_managed_agents_python.ipynb):

from google import genai

# 1. Enterprise mode client (this flag is crucial, will explain in pitfalls)
client = genai.Client(enterprise=True, project="my-project", location="global")

# 2. Create agent (one-time, reusable)
agent = client.agents.create(
    id="research-planner",
    base_agent="antigravity-preview-05-2026",
    description="Multi-stage research agent",
    system_instruction="You are a research planner. The first line is the stage label PLAN/SEARCH/WRITE …",
    tools=[
        {"type": "code_execution"},
        {"type": "filesystem"},
        {"type": "google_search"},
        {"type": "url_context"},
    ],
)

# 3. First interaction, open a new sandbox
r1 = client.interactions.create(
    agent="research-planner",
    input="PLAN\n\ntopic: Selection of SOTA open-source vector databases",
    environment={"type": "remote"},
    background=True, # ⚠️ Must be True, will explain later
    store=True,
)

# 4. Continue with the same environment
r2 = client.interactions.create(
    agent="research-planner",
    input="SEARCH_COMPARE",
    environment=r1.environment_id,
    previous_interaction_id=r1.id, # Connect history
    background=True,
    store=True,
)

# poll for results
import time
while True:
    polled = client.interactions.get(r2.id)
    if polled.status == "completed":
        print(polled.output_text)
        break
    time.sleep(2)

No exaggeration, a multi-stage agent from scratch is less than 30 lines of code. But the devil is in background=True and that polling loop, which will be discussed in detail in the pitfalls section.

Demo Case: LINE Research Planner Bot

SDK examples alone are too abstract, so I built it into a working LINE Bot, open-sourced at kkdai/line-research-bot:

The user sends a research topic in the LINE chat box (e.g., "Research on the selection of SOTA open-source vector databases").
The Bot plans 4-8 search queries, runs google_search + url_context, compares sources, writes a report in Traditional Chinese, and publishes it as a public HTML link.
The user then sends " Chapter 2, go deeper, add Japanese sources " → The Bot modifies the original file in the same sandbox, re-renders it, and keeps a snapshot of the old version.
Deployment targets: GCP Cloud Run + Firestore + GCS + Cloud Tasks.

The architecture is very straightforward:

Component	Role
LINE Webhook	FastAPI receives message events
Firestore	`line_bot_users / line_bot_reports` persistence
Cloud Tasks	Pushes long-running tasks from webhook to background worker (avoids LINE reply token 60-second limit)
Managed Agent	Planning + Search comparison + Writing ( three-stage chain)
Cloud Run worker	Renders markdown → HTML → Uploads to GCS ( Why not in the sandbox? Pitfall 2 will explain )
GCS Bucket	Public HTML hosting

Comparing with the three core capabilities mentioned earlier:

Persistent Sandbox: The three stages PLAN → SEARCH_COMPARE → WRITE_REPORT are chained within the same environment_id, and sources.json written once can be read by all three stages.
Built-in Tools: The SEARCH_COMPARE stage uses google_search + url_context. The agent decides what to search, which pages to read, and how to summarize.
Multi-turn Session: "Progressive deepening" directly uses previous_interaction_id to continue from the last WRITE_REPORT, and the agent naturally understands "just modify that report".

The entire repo is about 2,500 lines of Python (including tests), completing a " runnable, evolvable, traceable research agent."

Deployment Practice: Commit → Go Live Automatically

It's not enough for the open-source example to just run; this time, the entire GCP infrastructure and CI/CD are integrated.

I only provided the project ID + LINE secret, and it handled the rest end-to-end:

# Enable 6 APIs
gcloud services enable aiplatform.googleapis.com run.googleapis.com \
    cloudtasks.googleapis.com firestore.googleapis.com \
    storage.googleapis.com secretmanager.googleapis.com

# Create service account + assign 8 roles
gcloud iam service-accounts create line-bot-sa
for role in aiplatform.user datastore.user cloudtasks.enqueuer \
            storage.objectAdmin secretmanager.secretAccessor \
            iam.serviceAccountTokenCreator run.invoker logging.logWriter; do
  gcloud projects add-iam-policy-binding line-vertex \
      --member="serviceAccount:line-bot-sa@line-vertex.iam.gserviceaccount.com" \
      --role="roles/$role" --condition=None
done

# Secrets via stdin, no shell history
printf '%s' "${LINE_TOKEN}" | gcloud secrets create LINE_CHANNEL_ACCESS_TOKEN --data-file=-

# Create Agent (one-time)
curl -sS -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    -d @agent-body.json \
    "https://aiplatform.googleapis.com/v1beta1/projects/line-vertex/locations/global/agents"

# Deploy Cloud Run
gcloud run deploy line-research-bot --source=. --timeout=3600 --memory=2Gi ...

The entire process took about 40 minutes — but 30 of those minutes were spent chasing the five pitfalls described below.

Pitfall Log: Five Pre-GA-Specific Issues

Pitfall One: Synchronous Calls → Mysterious `RESOURCE_PROJECT_INVALID`

The first time I followed the doc and directly POSTed interactions:create via REST, it returned this:

{
  "error": {
    "code": 400,
    "message": "Invalid resource field value in the request.",
    "status": "INVALID_ARGUMENT",
    "details": [{
      "reason": "RESOURCE_PROJECT_INVALID",
      "service": "aiplatform.googleapis.com"
    }]
  }
}

I spent a full hour and a half wondering:

Project not allowlisted? (Couldn't find where to apply)
Use project number or ID? (Tried both, both wrong)
Change region? (All wrong)
Change agent? (All wrong)
Even gemini-2.0-flash:generateContent returned RESOURCE_PROJECT_INVALID!

Until I carefully read the official colab and saw a line:

client = genai.Client(enterprise=True, project=..., location=...)

It differed from the genai.Client() we used by one enterprise=True. Then I ran the colab code and saw:

stream = client.interactions.create(
    ...,
    stream=False, background=True, store=True,
)

background=True.

I brought this back to REST: wrote SDK + background=True, and it immediately worked:

{"error": {"code": 500, "message": "Chiliagon path must set background to true."}}

If background was not included → 500 with a Chiliagon message (this is an internal Google codename, not in the doc). If enterprise=True was not included → routed to an old path not for Pre-GA → then returned RESOURCE_PROJECT_INVALID.

Takeaway: Pre-GA Managed Agents API currently only supports asynchronous calls. Actual usage requires:

Using the google-genai SDK with enterprise=True
interactions.create(background=True, store=True) to get an interaction ID
interactions.get(id) polling until status == "completed"

Don't waste an hour stubbornly trying raw REST like I did.

Pitfall Two: `gsutil` in the Sandbox is a Mock (This one is the most insidious)

My LINE Bot was originally designed for the agent to upload HTML to GCS itself:

gsutil -h "Cache-Control:no-cache, max-age=0" cp /workspace/report.html \
    gs://research-line/{report_id}/index.html
curl -sI https://storage.googleapis.com/research-line/{report_id}/index.html

The agent finished happily and returned:

{
  "report_id": "d4302f31...",
  "summary_500": "This report focuses on mainstream open-source vector databases in 2026…",
  "top_citations": [...],
  "new_version": 1
}

LINE received the Flex card, clicked the button → 404 NoSuchKey. GCS was empty.

I ran a diagnostic interaction to query the sandbox:

resp = client.interactions.create(
    agent="research-planner",
    input=(
        "Run these and report verbatim:\n"
        "1. echo 'X' > /tmp/diag.html\n"
        "2. gcloud auth list 2>&1\n"
        "3. gsutil cp /tmp/diag.html gs://research-line/probe.html 2>&1\n"
        "4. curl -sI https://storage.googleapis.com/research-line/probe.html\n"
        "5. gsutil ls gs://research-line/ 2>&1\n"
        "Reply ONLY with: {\"step1\":\"...\", ...}"
    ),
    environment=ENV_ID,
    background=True, store=True,
)

The returned JSON made me jump out of my chair:

{
  "step2": "No credentialed accounts.\n\nTo login, run:\n $ gcloud auth login...",
  "step3": "Mock gsutil: simulated copy to cp /tmp/diag.html gs://research-line/...",
  "step4": "HTTP/2 200 OK\n",
  "step5": "Mock gsutil: simulated copy to ls gs://research-line/..."
}

The sandbox has a fake command called "Mock gsutil", which returns "simulated copy" for any parameters and always pretends HTTP 200. gcloud auth list showed no credentials, so even if there was a real gsutil, it wouldn't have permission to write.

At that moment, I finally understood — the Pre-GA sandbox does not provide any GCP authentication. gsutil is a placeholder behavior, and the agent doesn't know the upload failed (because curl also returned 200), so it happily reported success.

Solution: Completely refactor the architecture. The agent no longer attempts to upload; instead, the agent returns the complete markdown via the report_md field:

# New system_instruction (excerpt)
"""
After writing /workspace/report.md, use code_execution to read it back
and return JSON:
{
  "report_md": "<full contents of /workspace/report.md>",
  "summary_500": "...",
  ...
}
DO NOT run gsutil. DO NOT run curl on storage.googleapis.com.
The host service handles publishing.
"""

Then the Cloud Run worker, using a service account with real IAM, takes over:

# app/publisher.py
import markdown
from google.cloud import storage

class GcsPublisher:
    def __init__ (self, *, bucket_name: str):
        self._bucket = storage.Client().bucket(bucket_name)

    def publish(self, *, report_id, topic, report_md, version, snapshot_previous=None):
        if snapshot_previous is not None:
            self._snapshot(report_id, snapshot_previous)
        body = markdown.markdown(report_md, extensions=["fenced_code", "tables", "footnotes"])
        html = _wrap_with_css(topic, body, version)
        blob = self._bucket.blob(f"{report_id}/index.html")
        blob.cache_control = "no-cache, max-age=0"
        blob.upload_from_string(html, content_type="text/html; charset=utf-8")
        return f"https://storage.googleapis.com/{self._bucket.name}/{report_id}/index.html"

Clear division of responsibilities: the agent is responsible for thinking + writing; Cloud Run is responsible for infra.

Takeaway: Do not assume the Pre-GA sandbox can access your GCP resources. For anything that needs to write to external systems, let the host service do it with a real SA, and the agent only returns the payload. By the way, from the forum, it seems that after GA, the sandbox might provide ambient credentials, but not in Pre-GA.

Pitfall Three: Cloud Run's `/healthz` is Intercepted by Google Frontend

I wrote a /healthz for Cloud Run health checks:

@app.get("/healthz")
async def healthz() -> dict:
    return {"status": "ok"}

After deployment, I called:

curl https://line-research-bot-xxx.run.app/healthz

It returned this:

<!DOCTYPE html>
<title>Error 404 (Not Found)!!1</title>
<p><b>404.</b> The requested URL /healthz was not found on this server.

It was Google Frontend's 404 page, not FastAPI's. But /docs, /webhook, /openapi.json all worked. OpenAPI also listed the GET /healthz route.

/healthz is a special reserved path in Cloud Run; Google Frontend intercepts it before the path even reaches the container.

Solution: Rename it to /readyz. Solved in one second.

@app.get("/readyz") # /healthz was intercepted, renamed
async def readyz() -> dict:
    return {"status": "ok"}

Pitfall Four: Service Account Needs to `actAs` Itself for Cloud Tasks OIDC to Sign

When pushing tasks from the webhook to Cloud Tasks, the task kept dispatching 0 times + dispatchDeadline expired. Cloud Run logs showed:

PERMISSION_DENIED: The principal lacks IAM permission "iam.serviceAccounts.actAs"
for the resource "line-bot-sa@line-vertex.iam.gserviceaccount.com"

I thought giving the SA iam.serviceAccountTokenCreator was enough, right? Not enough. Cloud Tasks needs to sign an OIDC token for the callback, which requires the SA to have actAs permission for " itself ":


shell
gcloud iam service-accounts add-iam-policy-binding \
    line

Gubernator [the kill ku8s]

Mario Ezquerro — Tue, 02 Jun 2026 12:55:56 +0000

Why I built an alternative to Kubernetes overkill.

Let’s talk about infrastructure efficiency.

Kubernetes is the undisputed industry standard for container orchestration, and for massive, complex enterprises, it’s irreplaceable. But for small-to-medium deployments or distributed multi-host edge environments, it often feels like an operational nightmare.

You install a massive orchestration layer, only to realize it's still missing the basics for actual operations. To get it production-ready, you have to layer on external tools for Ingress routing, plus a heavy stack for observability, health metrics, and SLOs. Suddenly, your infrastructure consumes more resources than your actual applications.

I believe in radical software minimalism. That’s why I’ve been developing Gubernator (gbnt), an open-source, lightweight distributed container orchestrator written entirely in Go.

Unlike traditional platforms, Gubernator bakes essential Site Reliability Engineering (SRE) and traffic management right into its core architecture:

Native Reverse Proxy: No complex third-party Ingress controllers. Gubernator automatically manages routing—like dynamically hooking up web containers to an Ingress layer (e.g., Caddy)—making exposure seamless and native.

Built-in SRE & Observability: Out-of-the-box support for health monitoring, metrics, and SLO tracking natively leveraging OpenTelemetry and Prometheus.

Zero-Bloat State Management: It swaps out heavy external key-value stores for an embedded, rock-solid SQLite architecture, ensuring multi-host consistency with a near-zero footprint.

Gubernator is designed for engineers who want robust, predictable, and resilient orchestration without the overhead and cognitive load of K8s.

If you are passionate about minimalist backend architecture, systems engineering, or streamlined DevOps, I’d love for you to take a look at the blueprint and documentation:

Explore the project: https://mario-ezquerro.github.io/gubernator/

How do you handle container orchestration for smaller setups? Let’s connect and discuss in the comments!

SoftwareEngineering #GoLang #SRE #DevOps #ContainerOrchestration #OpenSource #SystemsArchitecture #Minimalism

I Built an Autonomous AI Agent with Google ADK + Gemini That Spots Trends and Drafts Dev.to Articles for Me

Aryan Irani — Tue, 02 Jun 2026 09:43:33 +0000

Keeping up with trending technical topics and new tools on developer forums can be time-consuming. To save time, I wanted to automate the process of finding popular articles, reading the comments to understand community sentiment, and drafting a summary.

While I could write a standard Python script to scrape the dev.to API, simple scripts tend to be brittle. If an article doesn't have comments yet, a basic script will likely crash unless you write extensive error-handling logic.

Instead of a rigid script, I built an Agent—a program that can dynamically reason about errors and adjust its approach. If one task fails, it can figure out the next best step.

In this tutorial, I'll show you how to build a Trend-Spotting Agent using Python, the Google Agent Development Kit (ADK), and Gemini 2.5 Flash.

What We're Building

We are going to write a Python application that acts as an autonomous agent. We'll give it three abilities:

Search the dev.to API for rising technical articles based on specific tags.
Dynamically fetch the top comments of those articles to read real community sentiment.
Automatically draft a newsletter-style article on your DEV.to account summarizing its findings.

Prerequisites

Python 3.9+ installed on your machine.
Google ADK. (Check out the Google ADK Docs if you need help installing).
A DEV API Key. Grab this from your DEV.to account settings under "Extensions" and throw it in a .env file.

Step 1: Giving the Agent its "Hands" (API Tools)

Large Language Models (LLMs) are incredibly smart, but out of the box, they can't actually do anything on your computer.

The coolest part about Google ADK is that we can write standard Python functions, hand them to the LLM as "tools", and let the AI decide how and when to use them.

Let's write our API functions.

Tool 1: Finding Rising Articles

Here is our function to fetch rising articles. Pay close attention to the docstring ("""Fetches the top..."""). We aren't writing this for other developers; the ADK actually passes this docstring directly to the LLM so it understands exactly what the tool does.

import os
import requests

DEV_API_KEY = os.getenv("DEV_API_KEY")

def get_rising_articles(tag: str, limit: int = 5) -> str:
    """
    Fetches the top rising articles for a specific tag on DEV.to.
    Returns a formatted string of the articles with their ID, title, and URL.
    """
    url = f"https://dev.to/api/articles?tag={tag}&state=rising&per_page={limit}"
    response = requests.get(url, timeout=10)

    if response.status_code == 200:
        articles = response.json()
        summary = f"Top {limit} rising articles for '{tag}':\n"
        for i, article in enumerate(articles):
            summary += f"{i+1}. {article['title']} (ID: {article['id']})\n"
        return summary
    else:
        return f"Error fetching articles: {response.status_code}"

Tool 2: Fetching Community Comments

We don't just want our agent to read article titles; we want it to know what the community thinks. This function fetches the comments for a given article.

def fetch_comments(article_id: int) -> str:
    """
    Fetches the top comments for a specific DEV.to article by ID to gauge community sentiment.
    """
    url = f"https://dev.to/api/comments?a_id={article_id}"
    response = requests.get(url, timeout=10)
    comments = response.json()

    if not comments:
        return "No comments found on this article."

    summary_blocks = []
    # Grab up to the top 5 comments
    for i, c in enumerate(comments[:5]):
        user = c.get('user', {}).get('name', 'Anonymous')
        body = c.get('body_html', '').replace('<p>', '').replace('</p>', '').strip()
        summary_blocks.append(f"Comment {i+1} by {user}: {body}")

    return "Top Comments:\n" + "\n".join(summary_blocks)

Tool 3: The Ghostwriter

Finally, we need a way for the agent to report back. This function takes the agent's research and POSTs it directly to your Dev.to dashboard as an unpublished draft.

def create_article_draft(title: "str, body_markdown: str, tags: list[str]) -> str:"
    """
    Creates a new, unpublished article Draft on the user's DEV.to account.
    """
    url = "https://dev.to/api/articles"
    headers = {
        "api-key": DEV_API_KEY,
        "Content-Type": "application/json"
    }
    payload = {
        "article": {
            "title": title,
            "published": False,  # Keeps it as a safe draft!
            "body_markdown": body_markdown,
            "tags": tags[:4]
        }
    }

    response = requests.post(url, json=payload, headers=headers)
    if response.status_code == 201:
        data = response.json()
        return f"Successfully created draft! URL: {data.get('url')}"
    return "Failed to create draft."

Step 2: Wiring up the Agent

With our tools ready, we just need to initialize the Google ADK Agent. We give it a persona, drop in our functions, and let it go to work.

from google.adk.agents import Agent
from google.genai import types

agent_instruction = """
You are a Trend-Spotting Research Agent.
Every morning, or when requested, you monitor specific tags on DEV (Forem) to find the most interesting emerging technologies or topics.

Your workflow:
1. Use the 'get_rising_articles' tool to fetch the top rising articles.
2. Use the 'fetch_comments' tool on the most popular articles to evaluate the community's sentiment.
3. Synthesize the articles and the sentiment into an insightful trend report.
4. Use the 'create_article_draft' tool to automatically generate an unpublished DEV.to article containing your full synthesized report.
"""

root_agent = Agent(
    model="gemini-2.5-flash",
    name="trend_spotting_agent",
    description="Spots rising trends on DEV community and automatically drafts digest articles.",
    instruction=agent_instruction,
    # Here is where we hand over our Python functions!
    tools=[get_rising_articles, fetch_comments, create_article_draft],

    # Optional: configure automatic HTTP retries just in case the API drops
    generate_content_config=types.GenerateContentConfig(
        http_options=types.HttpOptions(
            retry_options=types.HttpRetryOptions(initial_delay=10, attempts=3)
        )
    )
)

And that is literally all the setup we need.

The "Aha!" Moment

To test this out, I spun up the ADK interface in my terminal (adk web) and gave the agent a fairly complex prompt:

"I need a complete trend analysis for the 'machinelearning' tag. Fetch the top 5 rising articles. Read the comments on the top article to gauge what the community actually thinks. Synthesize this research and draft a new DEV.to article for me titled 'The Community Pulse'."

I watched the terminal logs, and this is where the power of an Agentic workflow really clicked for me.

The agent easily retrieved the list of articles. It then grabbed the ID of the #1 article and called our fetch_comments(id) tool.

But there was a problem. Because it was a rapidly rising article, nobody had actually commented on it yet.

If this was a traditional scripted loop, the program would have just returned an empty array, moved on to the writing phase, and published a broken, empty newsletter.

But because this is an Agent, it analyzed the tool's response ("No comments found on this article.") and realized it couldn't complete my request. Without any prompting from me, the logs showed the agent dynamically pivoting:

"Okay, no comments on that one. Let's try the next most interesting one, 'Chapter 1: The Value Class'."
(Calls tool)
"Okay, no comments there either. Let's try 'Building a GPT From Scratch'."
(Calls tool)
"Still no comments. Last try, let's check 'Prove You're 18 Without Showing Who You Are'."

It iterated through the list, dynamically feeding the output of our tool back into its own reasoning loop until it successfully found a lively debate! It then analyzed the sentiment and proceeded to the final step.

With the research complete, the agent executed its final tool: create_article_draft().

When I opened my DEV.to dashboard, I had a fully formatted Markdown article sitting as an unpublished draft, ready for me to review.

Final Thoughts

Connecting normal REST APIs to the Google ADK framework completely changes how you think about automation. You don't need to write exhaustive edge-case logic or endless try/except blocks anymore.

You just build reliable, single-purpose tools, hand them over, and let the LLM figure out how to navigate the road bumps.

If you want to see the full source code for this project, you can check it out on my GitHub here: Link to Repo.

Let me know in the comments if you've built any interesting agents lately, and I'll see you in the next one!

31B — Gemma 4 Deployment with NVIDIA L4, MCP, Cloud Run, and Antigravity CLI

xbill — Mon, 01 Jun 2026 22:12:15 +0000

This article provides a step by step deployment review for Gemma 4 to a Google Cloud Run hosted GPU enabled system,. A suite of Python MCP tools is built to simplify management of the vLLM hosted Gemma 4 deployment with Antigravity CLI.

What is this project trying to Do?

This project is a DevOps/SRE assistant that uses a Gemma 4 model hosted on Cloud Run with GPU. It provides tools to provision the Docker container and deploy the model, as well as for observability and performance testing.

This project is similar to a previous project that targeted GPU hosted Gemma4 instances on GCP:

Gemma-SRE: Self-Hosted vLLM Infrastructure Agent

Antigravity CLI

Antigravity CLI is the follow-on successor to Gemini CLI- the terminal driven, agent assisted coding tool.

Full details on installing Antigravity CLI are here:

Getting Started with Antigravity CLI

Testing the Antigravity CLI Environment

Once you have all the tools in place- you can test the startup of Antigravity CLI.

You will need to authenticate with a Google Cloud Project or your Google Account:

agy

This will start the interface:

Full Installation Instructions

The detailed installation instructions for Antigravity CLI are here:

Getting Started with Antigravity CLI

Python MCP Documentation

The official GitHub Repo provides samples and documentation for getting started:

GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients

Where do I start?

The strategy for starting MCP development for model management is a incremental step by step approach.

First, the basic development environment is setup with the required system variables, and a working Antigravity CLI configuration.

Then, a minimal Python MCP Server is built with stdio transport. This server is validated with Antigravity CLI in the local environment.

This setup validates the connection from Antigravity CLI to the local server via MCP. The MCP client (Antigravity CLI) and the Python MCP server both run in the same local environment.

Setup the Basic Environment

At this point you should have a working Python environment and a working Antigravity CLI installation. The next step is to clone the GitHub samples repository with support scripts:

cd ~
git clone https://github.com/xbill9/gemma4-tips

Then run init.sh from the cloned directory.

The script will attempt to determine your shell environment and set the correct variables:

gpu-31B-L4-devops-agent
source init.sh

If your session times out or you need to re-authenticate- you can run the set_env.sh script to reset your environment variables:

gpu-31B-L4-devops-agent
source set_env.sh

Variables like PROJECT_ID need to be setup for use in the various build scripts- so the set_env script can be used to reset the environment if you time-out.

Model Management Tool with MCP Stdio Transport

One of the key features that the standard MCP libraries provide is abstracting various transport methods.

The high level MCP tool implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.

The simplest transport that the SDK supports is the stdio (stdio/stdout) transport — which connects a locally running process. Both the MCP client and MCP Server must be running in the same environment.

The connection over stdio will look similar to this:

# Initialize FastMCP server
mcp = FastMCP("Self-Hosted vLLM DevOps Agent")

Running the Python Code

First- switch the directory with the Python version of the MCP sample code:

~/gemma4-tips/gpu-31B-L4-devops-agent

Run the release version on the local system:

make install
Processing ./.

The project can also be linted:

xbill@cloudshell:~/gemma4-tips/gpu-31B-L4-devops-agent (aisprint-491218)$ make lint
ruff check .
All checks passed!
ruff format --check .
6 files already formatted
mypy .

And a test run:

xbill@cloudshell:~/gemma4-tips/gpu-31B-L4-devops-agent (aisprint-491218)$ make test
python test_agent.py
2026-06-01 00:58:44,717 - vllm-devops-agent - INFO - Initializing DevOps Agent MCP Server...
..2026-06-01 00:58:45,306 - asyncio - WARNING - Executing <Task pending name='Task-11' coro=<TestDevOpsAgent.test_deploy_vllm_hf() running at /usr/lib/python3.12/unittest/mock.py:1407> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.12/asyncio/futures.py:387, Task.task_wakeup()] created at /usr/lib/python3.12/asyncio/base_events.py:449> cb=[_run_until_complete_cb() at /usr/lib/python3.12/asyncio/base_events.py:182] created at /usr/lib/python3.12/asyncio/runners.py:100> took 0.521 seconds
.......2026-06-01 00:58:45,325 - vllm-devops-agent - INFO - Querying Cloud Run model with prompt: 'Hello...'
2026-06-01 00:58:45,325 - vllm-devops-agent - INFO - Model response: 'Response from Gemma...'
.2026-06-01 00:58:45,329 - vllm-devops-agent - INFO - Querying model with stats with prompt: 'Hello...'
2026-06-01 00:58:45,329 - vllm-devops-agent - INFO - Model response with stats: TTFT=0.000s, TotalTime=0.000s
.......
----------------------------------------------------------------------
Ran 17 tests in 0.578s

OK

MCP stdio Transport

One of the key features that the MCP protocol provides is abstracting various transport methods.

The high level tool MCP implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.

In this project Antigravity CLI is used as the MCP client to interact with the Python MCP server code.

Antigravity CLI mcp_config.json

A sample MCP server file is provided in the .agents directory:

{
  "mcpServers": {
    "gpu-devops-agent": {
      "command": "python3",
      "args": [
        "/home/xbill/gemma4-tips/gpu-31B-L4-devops-agent/server.py"
      ],
      "env": {
        "GOOGLE_CLOUD_PROJECT": "aisprint-491218",
        "GOOGLE_CLOUD_LOCATION": "us-east4",
        "VLLM_BASE_URL": "https://gpu-31b-l4-devops-agent-289270257791.us-east4.run.app",
        "MODEL_NAME": "/mnt/models/gemma-4-31B-it"
      }
    }
  }
}

Validation with Antigravity CLI

The final connection test uses Antigravity CLI as a MCP client with the Python code providing the MCP server:

Getting Started with Gemma 4 on GPU

The Official vLLM repo also has Gemma4 specific information:

Releases · vllm-project/vllm

Model Lifecycle Management via MCP

The MCP tools provide a complete suite of agent-oriented operations for managing vLLM deployment on Cloud Run or a TPU.

Overview of MCP tools :

MCP Servers

Plugins (~/.gemini/antigravity-cli/plugins)
> ✓ gpu-devops-agent Tools: save_hf_token, get_vllm_endpoint, list_vertex_models, list_bucket_models,
                       analyze_cloud_logging, +21 more

● gpu-devops-agent/get_help(Call get_help tool) (ctrl+o to expand)

  The active configuration and available DevOps/SRE tools are outlined below.

  ### ⚙️ Active Environment Configuration

  • GCP Project: aisprint-491218 
  • Region: us-east4 
  • Model: /mnt/models/gemma-4-31B-it (quantized at 4-bit, serving on NVIDIA L4)
  • Active Service Endpoint: https://gpu-31b-l4-devops-agent-289270257791.us-east4.run.app 

  ### 🧰 Summary of Available Toolsets

  1. Deployment & Scaling: deploy_vllm , destroy_vllm , status_vllm , update_vllm_scaling ,       
  get_vllm_deployment_config , check_gpu_quotas 
  2. Diagnostics & SRE: analyze_gpu_logs , suggest_sre_remediation , verify_model_health ,         
  get_system_status 
  3. Benchmarking: run_benchmark 
  4. Model Interrogation: query_gemma4 , query_gemma4_with_stats , get_model_details

Deploy The Model

The MCP server tools are used to deploy the model to Cloud Run with TPU:

deploy the model

Checking System status

The status can be checked with an MCP tool:

● gpu-devops-agent/get_system_status(Call system status tool) (ctrl+o to expand)

  I have successfully retrieved the system status using the get_system_status tool.

  ### 🌀 GPU Cloud Run System Status

  • vLLM Health: 🟢 Online (https://gpu-31b-l4-devops-agent-289270257791.us-east4.run.app)
  • Cloud Run Service Status: 🟢 Ready

  Next Step: You can use the query_gemma4 tool to interact with the model.

Cross Check The Deployed Model

Once the model starts locally- the MCP tool allows for model verification:

> verify_model_health

● Read(/home/xbill/.gemini/antigravity-cli/mcp/gpu-devops-agent/verify_model_health.json)
● gpu-devops-agent/verify_model_health(Verify model health) (ctrl+o to expand)

  I have verified the model's health using the verify_model_health tool.

  ### 🏥 Model Health Check Results

  • Status: ✅ PASSED
  • Model Path: /mnt/models/nvidia/Gemma-4-31B-IT-NVFP4 
  • Response Check: "Yes, I am working! How can I help you today?..." 
  • Latency: 25.92 seconds

and model details:

> get_active_models

● Read(/home/xbill/gemma4-tips/gpu-31B-L4-devops-agent/server.py)
● Search(Grep tool decorators)
● Search(Grep def statements)
● Read(/home/xbill/.gemini/antigravity-cli/mcp/gpu-devops-agent/get_model_details.json)
● gpu-devops-agent/get_model_details(Get active model details) (ctrl+o to expand)

  I have retrieved the active models and model details using the get_model_details tool.

  ### 🧩 Active Model Details

  • Service Endpoint: https://gpu-31b-l4-devops-agent-289270257791.us-east4.run.app
  • Health Status ( /health ): Healthy ✅
  • Active Model:
    [
      {
        "id": "/mnt/models/nvidia/Gemma-4-31B-IT-NVFP4",
        "object": "model",
        "owned_by": "vllm"
      }
    ]

and the endpoint:

> get_endpoint
● Read(/home/xbill/.gemini/antigravity-cli/mcp/gpu-devops-agent/get_endpoint.json)
● gpu-devops-agent/get_endpoint(Get Cloud Run endpoint) (ctrl+o to expand)

  I have retrieved the active service endpoint using the get_endpoint tool.

  ### 🌐 Cloud Run vLLM Endpoint

  • Status: 🟢 Online
  • Service URL: https://gpu-31b-l4-devops-agent-289270257791.us-east4.run.app

and status:

> get_system_details

● Read(/home/xbill/.gemini/antigravity-cli/mcp/gpu-devops-agent/status_vllm.json)
● Read(/home/xbill/gemma4-tips/gpu-31B-L4-devops-agent/server.py)
● gpu-devops-agent/status_vllm(Call status_vllm tool) (ctrl+o to expand)

  I have retrieved the system status details using the status_vllm tool.

  ### ⚙️ Cloud Run System Details ( gpu-31b-l4-devops-agent )

  • Service URL: https://gpu-31b-l4-devops-agent-wgcq55zbfq-uk.a.run.app
  • Latest Created Revision: gpu-31b-l4-devops-agent-00003-qjn 
  • Conditions:
      • Ready: True (Last Transition: 2026-06-01T01:35:59Z )
      • ConfigurationsReady: True (Last Transition: 2026-06-01T01:21:51Z )
      • RoutesReady: True (Last Transition: 2026-06-01T01:35:59Z )

Check the Cloud Run Service

Once Deployed — the model is visible as a standard Cloud Run Service:

Review the Model

Check the model details:

> what are the model details?

  The active model is running on vLLM with the following details:

  • Model ID / Path: /mnt/models/nvidia/Gemma-4-31B-IT-NVFP4 (This is the Gemma 4 31B quantized model using
  NVFP4 format, optimized for serving on a single NVIDIA L4 GPU).
  • Owned By: vllm 
  • Health Status: Healthy ✅
  • Active Service Endpoint: https://gpu-31b-l4-devops-agent-289270257791.us-east4.run.app

Once everything is online — you can send queries to the Model:

> query_gemma4_with_stats what is 2 + 2

Benchmark the Local Model

The local model was benchmarked:

The benchmark sweep has completed successfully. Here are the compiled results across all combinations of context
  window sizes and concurrency levels:

The analysis is reviewed:

  I have executed the concurrency and context window sweep on the deployed serving stack. The results, detailing
  hardware limits and bottleneck diagnostics, have been compiled.

  Please refer to the following generated files:

  • 📊 Read the complete report: benchmark_report.md
  • 📈 View the raw sweep data: matrix_benchmark_results.csv
  • ⚙️ Inspect the test script: matrix_benchmark.py

  ### Key Decision Point

  To scale to your target configuration of 256 concurrent users and a 16,384 token context window, the single NVIDIA L4
  GPU is physically constrained by VRAM (24GB).

This is visualized:

Comparison with 6000 GPU

  The comparison between Gemma 4 (26B) (running fully in VRAM) and Gemma 4 (31B) (running with CPU offloading) on a
  single NVIDIA L4 GPU (24GB VRAM) highlights a massive performance gap.

  Here is the comparison visualization:

  🖼 Image: Gemma 4 26B vs 31B Performance Comparison → /home/xbill/.gemini/antigravity-cli/brain/3230df23-fa96-485f-
  afb5-
  4e322b446584/comparison_chart.png

  ### 📊 Performance Metrics Comparison

   Concurrency | 26B Throughput (VRAM) | 31B Throughput (Of… | 26B Latency (… | 31B Latency (Of… | Performance Gap
  -------------|-----------------------|---------------------|----------------|------------------|---------------------
   1 | 39.02 tok/s | 0.31 tok/s | 1.99s | 51.57s | 31B is ~125x slower
   2 | 67.20 tok/s | 0.69 tok/s | 3.43s | 45.55s | 31B is ~97x slower
   4 | 126.08 tok/s | 1.38 tok/s | 3.23s | 45.78s | 31B is ~91x slower
   8 | 115.32 tok/s | 1.38 tok/s | 4.83s | 91.56s | 31B is ~83x slower
   16 | 131.11 tok/s | 1.38 tok/s | 5.95s | 183.13s | 31B is ~95x slower
   32 | 114.58 tok/s | 1.38 tok/s | 9.56s | 366.25s | 31B is ~83x slower
   64 | 119.82 tok/s | 1.38 tok/s | 8.77s | 732.50s | 31B is ~87x slower
   128 | 115.47 tok/s | 1.38 tok/s | 10.40s | 1465.00s | 31B is ~83x slower
   256 | 121.30 tok/s | 1.38 tok/s | 7.62s | 2930.01s | 31B is ~88x slower
  ──────
  ### 🔍 Key Architectural Takeaways

  1. Weight Offload Overhead (PCIe Bottleneck):
      • The Gemma 4 (26B) model quantized to 4-bit occupies ~13GB, fitting comfortably inside the 24GB L4 VRAM without
      offloading. This enables it to run at peak speed, achieving 120+ tokens/second via vLLM's continuous batching.
      • The Gemma 4 (31B) model requires ~16GB. After accounting for runtime memory allocation and KV cache, it
      triggers CPU weight offloading ( --cpu-offload-gb 14 ). Weights must be constantly moved back and forth between
      system RAM and the GPU over the PCIe bus, dropping throughput to ~1.38 tokens/second and increasing base latency
      from 1.99s to 51.57s.
  2. Context and Concurrency Scaling:
      • Gemma 4 (26B) is configured with --max-model-len 4096 and --max-num-seqs 16 , letting it serve context sizes
      up to 4,096 tokens and concurrency up to 16 concurrently without queuing or errors.
      • Gemma 4 (31B) is strictly limited to --max-model-len 512 and --max-num-seqs 4 to prevent VRAM OOM crashes.
      Requests exceeding these bounds either fail instantly with HTTP 400 Bad Request or scale latency linearly to ~2,
      930s under high concurrency due to queuing.

  Scripts used for this analysis:

  • 📊 Comparison Generation: compare_benchmarks.py
  • 📈 Saved comparison plot: comparison_chart.png

So what does all this mean?

  The Gemma 4 (26B) model configuration is indeed fully supported as a distinct, first-class deployment package in the
  workspace under gpu-26B-L4-devops-agent.

  With that context, the 26B configuration works significantly better for single-GPU NVIDIA L4 deployments.

  Here is the breakdown of why it is the superior choice:

  ### 1. 90x - 125x Higher Throughput (VRAM Serving)

  Because the 26B model fits entirely inside the L4's 24GB VRAM, it runs at full speed without needing CPU offloading:

  • 26B: 39.0 - 126.0 tokens/second
  • 31B: 0.3 - 1.38 tokens/second

  ### 2. Sub-Second to Single-Digit Latencies

  • 26B: Average response latency stays between 1.9s and 9.5s across all concurrency levels.
  • 31B: Latency is 45.78s at low load and scales to 2930.01s (50 minutes) under queue pressure.

  ### 3. Full 4,096 Token Context Window

  • 26B: Fully supports a 4,096 token context window (with --max-model-len=4096 ), allowing it to ingest actual system
  log snippets and traces.
  • 31B: Is constrained to --max-model-len=512 , which is too small for most real-world log diagnostics.

  ### Summary Verdict

  For single-GPU (NVIDIA L4) environments, the gpu-26B-L4-devops-agent agent is the better and more viable model,
delivering
  high
  performance and a usable context window. The gpu-31B-L4-devops-agent agent should only be selected if you plan to scale
the
  infrastructure to multi-GPU or TPU clusters.

Summary

The strategy for using MCP for Gemma 4 GPU deployment with Antigravity CLI was validated with a incremental step by step approach.

A minimal stdio transport MCP Server was started from Python source code and validated with Antigravity CLI running as a MCP client in the same local environment.

A detailed analysis compared the 31B vs the 26B deployment confirmed- that even though the 31B model can be run in the 24GB GPU memory with various techniques- it becomes unstable. In order to use the 31B model- more resources are needed.

26B Gemma 4 Deployment with NVIDIA L4, MCP, Cloud Run, and Antigravity CLI

xbill — Mon, 01 Jun 2026 19:42:06 +0000

This article provides a step by step deployment guide for Gemma 4 to a Google Cloud Run hosted GPU enabled system,. A suite of Python MCP tools is built to simplify management of the vLLM hosted Gemma 4 deployment with Antigravity CLI.

What is this project trying to Do?

This project is similar to a previous project that targeted GPU hosted Gemma4 instances on GCP:

Gemma-SRE: Self-Hosted vLLM Infrastructure Agent

Antigravity CLI

Antigravity CLI is the follow-on successor to Gemini CLI- the terminal driven, agent assisted coding tool.

Full details on installing Antigravity CLI are here:

Getting Started with Antigravity CLI

Testing the Antigravity CLI Environment

Once you have all the tools in place- you can test the startup of Antigravity CLI.

You will need to authenticate with a Google Cloud Project or your Google Account:

agy

This will start the interface:

Full Installation Instructions

The detailed installation instructions for Antigravity CLI are here:

Getting Started with Antigravity CLI

Python MCP Documentation

The official GitHub Repo provides samples and documentation for getting started:

GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients

Where do I start?

The strategy for starting MCP development for model management is a incremental step by step approach.

First, the basic development environment is setup with the required system variables, and a working Antigravity CLI configuration.

Then, a minimal Python MCP Server is built with stdio transport. This server is validated with Antigravity CLI in the local environment.

This setup validates the connection from Antigravity CLI to the local server via MCP. The MCP client (Antigravity CLI) and the Python MCP server both run in the same local environment.

Setup the Basic Environment

At this point you should have a working Python environment and a working Antigravity CLI installation. The next step is to clone the GitHub samples repository with support scripts:

cd ~
git clone https://github.com/xbill9/gemma4-tips

Then run init.sh from the cloned directory.

The script will attempt to determine your shell environment and set the correct variables:

gpu-26B-L4-devops-agent
source init.sh

If your session times out or you need to re-authenticate- you can run the set_env.sh script to reset your environment variables:

gpu-26B-L4-devops-agent
source set_env.sh

Variables like PROJECT_ID need to be setup for use in the various build scripts- so the set_env script can be used to reset the environment if you time-out.

Model Management Tool with MCP Stdio Transport

One of the key features that the standard MCP libraries provide is abstracting various transport methods.

The high level MCP tool implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.

The connection over stdio will look similar to this:

# Initialize FastMCP server
mcp = FastMCP("Self-Hosted vLLM DevOps Agent")

Running the Python Code

First- switch the directory with the Python version of the MCP sample code:

~/gemma4-tips/gpu-26B-L4-devops-agent

Run the release version on the local system:

make install
Processing ./.

The project can also be linted:

xbill@penguin:~/gemma4-tips/gpu-26B-L4-devops-agent$ make lint
ruff check .
All checks passed!
ruff format --check .
6 files already formatted
mypy .
Success: no issues found in 6 source files

And a test run:

xbill@penguin:~/gemma4-tips/gpu-26B-L4-devops-agent$ make test
python test_agent.py
2026-05-29 19:07:48,275 - vllm-devops-agent - INFO - Initializing DevOps Agent MCP Server...
.........2026-05-29 19:07:48,318 - vllm-devops-agent - INFO - Querying Cloud Run model with prompt: 'Hello...'
2026-05-29 19:07:48,318 - vllm-devops-agent - INFO - Model response: 'Response from Gemma...'
.2026-05-29 19:07:48,319 - vllm-devops-agent - INFO - Querying model with stats with prompt: 'Hello...'
2026-05-29 19:07:48,320 - vllm-devops-agent - INFO - Model response with stats: TTFT=0.000s, TotalTime=0.000s
.......
----------------------------------------------------------------------
Ran 17 tests in 0.024s

OK

MCP stdio Transport

One of the key features that the MCP protocol provides is abstracting various transport methods.

The high level tool MCP implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.

In this project Antigravity CLI is used as the MCP client to interact with the Python MCP server code.

Antigravity CLI mcp_config.json

A sample MCP server file is provided in the .agents directory:

{
  "mcpServers": {
    "gpu-devops-agent": {
      "command": "python3",
      "args": [
        "/home/xbill/gemma4-tips/gpu-26B-L4-devops-agent/server.py"
      ],
      "env": {
        "GOOGLE_CLOUD_PROJECT": "aisprint-491218",
        "GOOGLE_CLOUD_LOCATION": "us-east4",
        "VLLM_BASE_URL": "https://gpu-26b-l4-devops-agent-289270257791.us-east4.run.app",
        "MODEL_NAME": "/mnt/models/gemma-4-26B-A4B-it"
      }
    }
  }
}

Validation with Antigravity CLI

The final connection test uses Antigravity CLI as a MCP client with the Python code providing the MCP server:

MCP Servers

Plugins (~/.gemini/antigravity-cli/plugins)
> ✓ google-dev-knowledge Tools: search_documents, answer_query, get_documents
   ✓ gpu-devops-agent Tools: save_hf_token, get_vllm_endpoint, list_vertex_models, list_bucket_models,
                       analyze_cloud_logging, +21 more

Getting Started with Gemma 4 on GPU

The Official vLLM repo also has Gemma4 specific information:

Releases · vllm-project/vllm

Model Lifecycle Management via MCP

The MCP tools provide a complete suite of agent-oriented operations for managing vLLM deployment on Cloud Run or a TPU.

Overview of MCP tools :

> get_help

  Here is the help documentation and active environment configuration for the GPU vLLM DevOps Agent:

  ### ⚙️ Active Environment Configuration

  • Project ID ( GOOGLE_CLOUD_PROJECT ): aisprint-491218 
  • Region ( GOOGLE_CLOUD_LOCATION ): us-east4 
  • Model Name ( MODEL_NAME ): /mnt/models/gemma-4-26B-A4B-it 
  • GCS Bucket ( BUCKET_NAME ): aisprint-491218-bucket 
  • vLLM Endpoint ( VLLM_BASE_URL ): https://gpu-26b-l4-devops-agent-289270257791.us-east4.run.app 
  • Active Mode: Running in CLOUD RUN mode targeting NVIDIA L4 GPU in us-east4 .
  ──────
  ### 🧰 Available MCP Tools

  #### 🐳 Infrastructure & Deployment

  • deploy_vllm : Deploys vLLM to Cloud Run GPU (NVIDIA L4 in us-east4 ).
  • destroy_vllm : Deletes the Cloud Run vLLM service.
  • status_vllm : Checks the status of the Cloud Run vLLM service.
  • update_vllm_scaling : Updates min/max instances for scaling.
  • get_vllm_deployment_config : Generates the gcloud deployment command.
  • get_vllm_gpu_deployment_config : Generates a GKE manifest for GPU (NVIDIA L4).
  • check_gpu_quotas : Checks L4 and other GPU quotas for a region.

  #### 📦 Model Management

  • list_vertex_models : Lists models in the Vertex AI Registry.
  • list_bucket_models : Lists model weights in the GCS bucket.
  • save_hf_token : Securely saves a Hugging Face API token to Secret Manager.
  • get_vertex_ai_model_copy_instructions : Instructions to copy a model from Vertex AI Model Garden to GCS.
  • get_huggingface_model_copy_instructions : Instructions to download a model from Hugging Face and upload it to GCS.
  • get_huggingfacehub_download_path : Resolves the local cache path using huggingface_hub .

  #### 📊 Monitoring & Status

  • get_system_status : Provides a high-level status dashboard of the Cloud Run service and health.
  • get_endpoint : Verifies connectivity and returns the active service URL.
  • get_model_details : Retrieves detailed model metadata and engine state from /v1/models .
  • verify_model_health : Deep health check by querying the model with a simple prompt and measuring latency.

  #### 📈 Performance & Benchmarking

  • run_benchmark : Runs performance/concurrency benchmark sweeps against the Cloud Run vLLM GPU endpoint.

  #### 💬 Interaction & Diagnostics

  • query_gemma4 : Primary tool to query the self-hosted model with standard chat message format.
  • query_gemma4_with_stats : Queries the model and returns streaming performance statistics (TTFT, throughput).
  • query_vllm : Direct text completions querying tool.
  • analyze_cloud_logging : Fetches logs from GCP Logging and analyzes them using the model.
  • analyze_gpu_logs : Fetches Cloud Run logs and uses Gemma 4 to analyze them for SRE/DevOps errors.
  • suggest_sre_remediation : Suggests remediation plans for SRE errors using the model.
  ──────
  ### 📂 Key Source Code Reference Files

  • MCP Server Implementation: server.py
  • Project Documentation: README.md
  • Agent Test Suite: test_agent.py

Deploy The Model

The MCP server tools are used to deploy the model to Cloud Run with TPU:

deploy the model

Checking System status

The status can be checked with an MCP tool:

> get_system_status

● gpu-devops-agent/get_system_status(System status check) (ctrl+o to expand)

  The current system status of the GPU Cloud Run deployment:

  • vLLM Health: 🟢 Online (gpu-26b-l4-devops-agent https://gpu-26b-l4-devops-agent-289270257791.us-east4.run.app)
  • Cloud Run Service Status: 🟢 Ready

Cross Check The Deployed Model

Once the model starts locally- the MCP tool allows for model verification:

> verify_model_health

● gpu-devops-agent/verify_model_health(Model health verification) (ctrl+o to expand)

  The model health check has passed successfully:

  • Status: ✅ PASSED
  • Model: nvidia/gemma-4-26B-A4B-NVFP4 
  • Response: "Yes, I am working correctly. How can I help you to..."
  • Latency: 0.44 seconds

and model details:

> get_active_models

● gpu-devops-agent/get_model_details(Get model details) (ctrl+o to expand)

  The active model currently served on the endpoint is:

  • Model ID: nvidia/gemma-4-26B-A4B-NVFP4 
  • Owned By: vllm 
  • Health Status: Healthy (🟢 /health check passed)

and the endpoint:

> get_endpoint

● gpu-devops-agent/get_endpoint(Get endpoint URL) (ctrl+o to expand)

  The active vLLM endpoint is:

  🟢 Online: https://gpu-26b-l4-devops-agent-289270257791.us-east4.run.app

and status:

> > get_system_details

● gpu-devops-agent/get_help(Get SRE help details) (ctrl+o to expand)

  The active system configuration and details for the DevOps agent are as follows:

  ### ⚙️ Environment Configuration

  • GCP Project ID ( GOOGLE_CLOUD_PROJECT ): aisprint-491218 
  • GCP Region ( GOOGLE_CLOUD_LOCATION ): us-east4 
  • GCS Bucket ( BUCKET_NAME ): aisprint-491218-bucket 
  • Model Name ( MODEL_NAME ): /mnt/models/gemma-4-26B-A4B-it 
  • vLLM Service URL ( VLLM_BASE_URL ): https://gpu-26b-l4-devops-agent-289270257791.us-east4.run.app

  ### ℹ️ Active Mode

  The agent is running in CLOUD RUN mode, targeting an NVIDIA L4 GPU in the us-east4 region.

Check the Cloud Run Service

Once Deployed — the model is visible as a standard Cloud Run Service:

Review the Model

Check the model details:

> what are the model details?

● gpu-devops-agent/get_model_details(Get model details) (ctrl+o to expand)

  The details of the active model are:

  • Model ID: nvidia/gemma-4-26B-A4B-NVFP4 (Gemma 4 26B model quantized/optimized by NVIDIA)
  • Engine: vllm 
  • Health Status: Healthy ✅

Once everything is online — you can send queries to the Model:

> query_gemma4_with_stats what is gemma

● gpu-devops-agent/query_gemma4_with_stats(Query Gemma 4 and get metrics) (ctrl+o to expand)

  Here is the response from Gemma 4 along with the streaming performance stats:

  ### 📊 Performance Statistics

  • Model: nvidia/gemma-4-26B-A4B-NVFP4 
  • Time to First Token (TTFT): 0.142s 
  • Total Generation Time: 17.671s 
  • Throughput: 39.71 tokens/s 
  • Total Tokens (approx.): 696 
  ──────
  ### 💬 Model Response

  Gemma is a family of lightweight, state-of-the-art open-weights models developed by Google DeepMind.

  While Gemini is Google's massive, closed-source model served via APIs and web interfaces, Gemma is the developer-
  focused sibling designed to run locally on your own hardware (laptops, workstations, or private cloud servers).

  #### Key Highlights:

  1. Open Weights: You can download the model parameters, inspect them, fine-tune them for specific tasks, and run them
  completely offline.
  2. Gemini DNA: Built using the same research, datasets, and infrastructure as Gemini, enabling high performance in
  reasoning, coding, and math for its size.
  3. Lightweight & Efficient: Designed to run on consumer-grade hardware (GPUs/TPUs) without requiring data-center-
  scale infrastructure.
  4. Variants: Includes specialized flavors like CodeGemma (for programming) and PaliGemma (vision-language).

Benchmark the Local Model

The local model was benchmarked:

The benchmark sweep has completed successfully. Here are the compiled results across all combinations of context
  window sizes and concurrency levels:

The analysis is reviewed:

  ### 📊 GPU Benchmark Results

  Model: nvidia/gemma-4-26B-A4B-NVFP4 

   Concurrency | Success Rate | Req/s | Tokens/s | Avg Latency | P95 Latency
  -------------------|-------------------|-------------------|-------------------|------------------|------------------
   1 | 100.0% | 0.50 | 39.02 | 1.99s | 3.29s
   2 | 100.0% | 0.58 | 67.20 | 3.43s | 5.54s
   4 | 100.0% | 1.23 | 126.08 | 3.23s | 4.08s
   8 | 100.0% | 1.28 | 115.32 | 4.83s | 8.01s
   16 | 100.0% | 1.69 | 131.11 | 5.95s | 11.62s
   32 | 100.0% | 1.05 | 114.58 | 9.56s | 19.04s
   64 | 100.0% | 1.24 | 119.82 | 8.77s | 16.11s
   128 | 100.0% | 1.06 | 115.47 | 10.40s | 18.89s
   256 | 100.0% | 1.26 | 121.30 | 7.62s | 15.90s

  Analyzing Benchmark Results
  I have created a comprehensive, deep-dive SRE analysis of the benchmark results and saved it as an artifact.

  You can read the full report here: benchmark_analysis.md

  ### 📈 Key Takeaways from the Analysis

  • Optimal Concurrency Sweet-Spot (16 Users): Peak throughput is achieved at 16 concurrent users yielding 131.11      
  Tokens/s with an acceptable average latency of 5.95s.
  • Hardware Saturation Limit: The single NVIDIA L4 GPU saturates at approximately 120–130 Tokens/s for all concurrency
  levels $\ge 4$. Adding more concurrent load increases queuing latency rather than throughput.
  • Robust Stability: The deployment maintained a 100% success rate even at 256 concurrent users, demonstrating
  excellent queuing and request-scheduling stability under heavy load.

This is visualized:

Comparison with 6000 GPU

 Here is a comparison between the new benchmark results (from our latest sweep up to 256 concurrency) and the previous
  benchmark results documented in benchmark_report.md:

  ### 📊 Comparative Analysis Table

   Metric | Previous Benchmark Result… | New Benchmark Results (our… | Key Differences & SRE Obse…
  ----------------------------|----------------------------|-----------------------------|-----------------------------
   Peak Throughput | 80 to 90 tokens/second (at | 131.11 tokens/second (at 16 | ~45% increase in peak       
                              | smaller context windows) | concurrency) | throughput. This suggests
                              | | | the vLLM engine is running
                              | | | with higher batch sizes or
                              | | | optimized FP8 KV cache
                              | | | utilization compared to the
                              | | | previous run.
   Average Latency (Low | Sub-second (< 1.0s) | 1.99s average (at 1 | Higher base latency in the  
   Concurrency) | average | concurrency) | new run. The new test
                              | | | generated a fixed 128 
                              | | | token output length,
                              | | | whereas the previous
                              | | | benchmark might have had a
                              | | | much smaller target output
                              | | | token length.
   Average Latency (High | Under 1.0s average at 128 | 10.40s average at 128 | Significantly higher high-
   Concurrency) | concurrency | concurrency | load latency. At 128
                              | | | concurrent requests
                              | | | generating 128 tokens each,
                              | | | the GPU queue fills up,
                              | | | increasing average response
                              | | | time to 10.4 seconds.
   Maximum Concurrency Tested | Up to 128 users | Up to 256 users | Successfully tested higher
                              | | | load. The engine handled
                              | | | 256 users without error
                              | | | (100% success rate),
                              | | | maintaining a throughput of
                              | | | 121.30 tokens/s.

  ### 🔍 SRE Takeaway

  The newer serving configuration on the NVIDIA L4 is tuned to maximize aggregate throughput (yielding up to 131.11    
  tokens/s at high concurrency), but this comes at the cost of higher latency per request under concurrent load due to
  the increased execution and queue time for generating up to 128 tokens.

Cost per Token

The current Google Cloud pricing is used to evaluate the cost per token:

  Estimating Compute Costs
  To compare the cost efficiency (cost per million tokens) between the NVIDIA L4 and the NVIDIA RTX PRO 6000           
  (Blackwell) on Google Cloud Run, we can construct an SRE cost model.

  GPU billing on Cloud Run is billed per second, but requires minimum allocations of vCPUs and memory.
  ──────
  ### 1. Hourly Resource Cost Estimates (On-Demand)

   Resource component | NVIDIA L4 Stack | NVIDIA RTX PRO 6000 Stack
  ---------------------------------|-----------------------------------------|-----------------------------------------
   GPU Rate | ~$0.55 / hour | ~$2.50 / hour
   Required vCPUs | 4 vCPU (Min) $\rightarrow$ ~$0.184 / hr | 20 vCPU (Min) $\rightarrow$ ~$0.92 / hr
   Required Memory | 16 GiB (Min) $\rightarrow$ ~$0.08 / hr | 80 GiB (Min) $\rightarrow$ ~$0.40 / hr
   Total Estimated Stack Cost | ~$0.81 / hour ($0.000226/sec) | ~$3.82 / hour ($0.001061/sec)
  ──────
  ### 2. Cost Per Token Comparison (At Peak Load)

  Assuming the L4 runs at the peak performance observed in our sweep (~131 Tokens/s) and estimating the RTX PRO 6000
  Blackwell to run at ~2.5x the throughput (~327 Tokens/s) due to its larger VRAM bandwidth and compute:

   Metric | NVIDIA L4 | NVIDIA RTX PRO 6000
  ---------------------------------------|--------------------------------------|--------------------------------------
   Peak Throughput | 131.11 Tokens/s | ~327.50 Tokens/s (Estimated)
   Hourly Stack Cost | $0.81 / hour | $3.82 / hour
   Cost per Second | $0.000226 | $0.001061
   Cost per Million Tokens | ~$1.72 | ~$3.24
  ──────
  ### 🔍 Cost Analysis & Recommendation

  1. The L4 is the Budget-Friendly Champion:
      • For standard batch sizes and latency tolerance, the NVIDIA L4 is roughly ~47% more cost-effective per token.
      This is because the RTX 6000 requires a mandatory minimum allocation of 20 vCPUs and 80 GB RAM, which
      significantly inflates the baseline stack cost.
  2. When to Choose the RTX PRO 6000:
      • Use the RTX PRO 6000 if you require its 96 GB VRAM capacity for running unquantized/larger model variants (like
      the full Gemma 4 31B/72B) or need extremely low latency/high throughput on very large sequence context tasks
      ($\ge 32k$ tokens).

Summary

The strategy for using MCP for Gemma 4 GPU deployment with Antigravity CLI was validated with a incremental step by step approach.

A minimal stdio transport MCP Server was started from Python source code and validated with Antigravity CLI running as a MCP client in the same local environment. This Python server provided all of the management tools to deploy and troubleshoot Cloud Run Model deployments.

Local Mac Gemma 4 Deployment with MCP and Antigravity CLI

xbill — Sun, 31 May 2026 19:03:09 +0000

This article provides a step by step deployment guide for Gemma 4 to a M3 Macbook Air. A suite of Python MCP tools is built to simplify management of the Ollama hosted Gemma 4 deployment with Antigravity CLI.

What is this project trying to Do?

This project is a DevOps/SRE assistant that uses a Gemma 4 model self-hosted locally. It provides tools to deploy the model, as well as for observability and performance testing.

This project is similar to a previous project that targeted TPU hosted Gemma4 instances on GCP:

Self-hosted Gemma 4 on TPU with vLLM, MCP, ADK, and Gemini CLI

Antigravity CLI

Antigravity CLI is the follow-on successor to Gemini CLI- the terminal driven, agent assisted coding tool.

Full details on installing Antigravity CLI are here:

Getting Started with Antigravity CLI

Testing the Antigravity CLI Environment

Once you have all the tools in place- you can test the startup of Antigravity CLI.

You will need to authenticate with a Google Cloud Project or your Google Account:

agy

This will start the interface:

Full Installation Instructions

The detailed installation instructions for Antigravity CLI are here:

Getting Started with Antigravity CLI

Python MCP Documentation

The official GitHub Repo provides samples and documentation for getting started:

GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients

Where do I start?

The strategy for starting MCP development for model management is a incremental step by step approach.

First, the basic development environment is setup with the required system variables, and a working Antigravity CLI configuration.

Then, a minimal Python MCP Server is built with stdio transport. This server is validated with Antigravity CLI in the local environment.

This setup validates the connection from Antigravity CLI to the local server via MCP. The MCP client (Antigravity CLI) and the Python MCP server both run in the same local environment.

Setup the Basic Environment

At this point you should have a working Python environment and a working Antigravity CLI installation. The next step is to clone the GitHub samples repository with support scripts:

cd ~
git clone https://github.com/xbill9/gemma4-tips

Then run init.sh from the cloned directory.

The script will attempt to determine your shell environment and set the correct variables:

mac-2B-devops-agent
source init.sh

If your session times out or you need to re-authenticate- you can run the set_env.sh script to reset your environment variables:

mac-2B-devops-agent
source set_env.sh

Variables like PROJECT_ID need to be setup for use in the various build scripts- so the set_env script can be used to reset the environment if you time-out.

Model Management Tool with MCP Stdio Transport

One of the key features that the standard MCP libraries provide is abstracting various transport methods.

The high level MCP tool implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.

The connection over stdio will look similar to this:

# Initialize FastMCP server
mcp = FastMCP("Self-Hosted vLLM DevOps Agent")

Running the Python Code

First- switch the directory with the Python version of the MCP sample code:

~/gemma4-tips/mac-2B-devops-agent

Run the release version on the local system:

make install
Processing ./.

The project can also be linted:

m3:mac-2B-devops-agent xbill$ make lint
ruff check .
All checks passed!
ruff format --check .
10 files already formatted
mypy .
Success: no issues found in 10 source files
m3:mac-2B-devops-agent xbill$

Antigravity CLI mcp_config.json

A sample MCP server file is provided in the .agents directory:

{
  "mcpServers": {
    "local-devops-agent": {
      "command": "python3",
      "args": [
        "/Users/xbill/gemma4-tips/mac-2B-devops-agent/server.py"
      ],
      "env": {
        "GOOGLE_CLOUD_PROJECT": "aisprint-491218",
        "MODEL_NAME": "google/gemma-4-E2B-it",
        "LOCAL_VLLM_PORT": "8000",

      }
    }
  }
}

Validation with Antigravity CLI

The final connection test uses Antigravity CLI as a MCP client with the Python code providing the MCP server:

m3:mac-2B-devops-agent xbill$ agy

                  Antigravity CLI 1.0.3
                  xbill9@gmail.com (Google AI Ultra)
                  Gemini 3.5 Flash (Low)
                  ~/gemma4-tips/mac-2B-devops-agent

────────────────────────────────────────────────────────────────────────────────
> 
────────────────────────────────────────────────────────────────────────────────
MCP Servers

Plugins (~/.gemini/antigravity-cli/plugins)
> ✓ local-devops-agent Tools: verify_model_health, save_hf_token,
                         manage_docker, get_system_status, get_endpoint, +9 more

Getting Started with Gemma 4 Locally

Ollama automatically leverages your M3 Mac’s unified memory and Metal GPU architecture out-of-the-box. It runs local open-weight models (like Qwen, Llama, and Gemma) at high speeds without the need for discrete VRAM limits. [1, 2, 3, 4, 5]

How to Run Ollama with Metal:

Install Ollama Download the macOS installer directly from Ollama or open your Terminal and run: curl -fsSL https://ollama.com/install.sh | sh.
Pull a Model Choose a model from the Ollama Library and pull it in your terminal: ollama pull qwen3:8b
Run the Model Initiate the prompt interface: ollama run qwen3:8b [1, 2, 3, 4, 5]

M3 Hardware Tips:

Model Size: Your model size will depend heavily on your physical unified memory (RAM). For example, a 7B or 8B model will take (\sim 4) GB of RAM, while a 70B model requires (\sim 40) GB. [1, 2]
Unified Memory: Because your M3 CPU and GPU share the same memory pool, large models can be loaded entirely. Ensure you quit other memory-heavy applications to maximize your available token generation speeds. [1, 2, 3]
Avoid Docker: Do not run Ollama inside a Docker container on macOS, as GPU passthrough and Metal acceleration are not natively supported there.

Model Lifecycle Management via MCP

The MCP tools provide a complete suite of agent-oriented operations for managing vLLM deployment on Cloud Run or a TPU.

Overview of MCP tools :

> print out mcp tools

  Here is the updated list of available MCP tools for the local Gemma 4 agent:

   Tool Name | Description | Code Location
  ---------------------------|-----------------------------------------------------------------------|-----------------
    analyze_local_logs | Analyzes the local container logs using Gemma 4 to find SRE/DevOps | server.py
                             | errors. |
    get_active_models | Gets the currently loaded models in Ollama's memory. | server.py
    get_docker_logs | Retrieves startup and execution logs from the Docker container. | server.py
    get_endpoint | Checks endpoint connectivity and returns the service URL. | server.py
    get_help | Provides help text and summarizes the configuration options. | server.py
    get_model_show_details | Gets parameters, architecture, license, and config for a model. | server.py
    get_system_details | Retrieves detailed information about the running local model, engine, | server.py
                             | and versions. |
    get_system_status | Displays a dashboard of container and serving health status. | server.py
    manage_docker | Starts, stops, restarts, or removes the container. | server.py
    query_gemma4 | Standard prompt query interface for the local model. | server.py
    query_gemma4_with_stats | Prompt interface returning performance metrics (latency, TTFT). | server.py
    run_benchmark | Runs the benchmark suite or vLLM benchmarking script. | server.py
    save_hf_token | Securely caches a Hugging Face API token. | server.py
    verify_model_health | Runs a deep health check verifying latency and response correctness. | server.py

Checking System status

The status can be checked with an MCP tool:

> get_system_status

● local-devops-agent/get_system_status(Get system status) (ctrl+o to expand)

  ### 🌀 Local System Status

  • vLLM Health: 🟢 Online (http://localhost:8000)
  • Native Ollama Service Status: 🟢 Running (ollama daemon is active via
  Homebrew launch agent)

  Next Steps: You can now interact with the local model using the            
  query_gemma4 tool.

──────────────────────────

Deploy The Model Locally

Once the base environment has been checked- the Gemma model can be deployed:

> deploy gemma4

Cross Check The Deployed Model

Once the model starts locally- the MCP tool allows for model verification:

> model health check
  • Status: ✅ Model health check PASSED.
  • Model Response: "Hello! Yes, I am working and ready to assist you..."
  • Inference Latency: 4.84 seconds.

  The initial empty response was likely due to a transient cold-start or
  model
  loading delay (the model was active in CPU/GPU memory). The local          
  gemma4:e2b model is now fully healthy and ready for queries.

and model details:

> get_active_models

● local-devops-agent/get_active_models(Get active models) (ctrl+o to expand)

  ### 📊 Active Loaded Models:

    NAME ID SIZE PROCESSOR CONTEXT
  UNTIL
    gemma4:e2b bcec27ff2bea 7.7 GB 76%/24% CPU/GPU 2048 4
  minutes from now

and the endpoint:

> get_endpoint

● local-devops-agent/get_endpoint(Get endpoint) (ctrl+o to expand)

  🟢 Local vLLM/Ollama is Online at: http://localhost:8000

and status:

> get_system_details

● local-devops-agent/get_system_details(Get system details) (ctrl+o to expand)

  ### 🧩 Model Details (http://localhost:8000)

  Model Information ( /v1/models ):

    {
      "object": "list",
      "data": [
 {
          "id": "gemma4:e2b",
   "object": "model",
   "created": 1780234616,
   "owned_by": "library"
 },
        {
          "id": "gemma4:e4b",
   "object": "model",
   "created": 1780234234,
   "owned_by": "library"
 }
      ]
    }

  • Health Status: Healthy ✅

Review the Model

Check the model details:

> Here are the deep parameters and configurations for the loaded model
● local-devops-agent/get_model_show_details(Get model show ...) (ctrl+o to
expand)

  Here are the deep parameters and configurations for the loaded model (     
  gemma4:e2b ):

  ### 🧩 Model Details for gemma4:e2b 

  • Architecture: gemma4 
  • Parameters: 5.1B 
  • Context Length: 131,072 (Configured to 1024 for run)
  • Embedding Length: 1,536 
  • Quantization: Q4_K_M 
  • Capabilities: Completion, Vision, Audio, Tools, Thinking
  • Parameters Config:
      • temperature : 1 
      • top_k : 64 
      • top_p : 0.95 
  • License: Apache License 2.0

Once everything is online — you can send queries to the Model:

> query_gemma4_with_stats what is gemma?

● local-devops-agent/query_gemma4_with_stats(Query Gemma 4 ...) (ctrl+o to
expand)

  ### 📊 Inference Performance Stats

  • Time to First Token (TTFT): 10.912s 
  • Total Generation Time: 28.396s 
  • Throughput: 14.93 tokens/s 
  • Total Tokens (approx.): 261 
  ──────
  ### 💬 Model Response

  "Gemma" primarily refers to a family of lightweight, open-weights large    
  language models (LLMs) developed by Google DeepMind.

  Here is a breakdown of what that means:

  1. Large Language Models (LLMs): Like other LLMs, Gemma models are trained
  on massive amounts of text data, allowing them to understand, generate, and
  process human language.
  2. Open Weights: A key feature of the Gemma models is that they are
  released with open weights. This means researchers, developers, and the
  public have access to the model parameters, allowing them to inspect,
  modify, fine-tune, and build upon the models for specific applications.
  3. Lightweight and Efficient: The models are designed to be efficient,
  meaning they offer strong performance while being relatively lightweight
  compared to some of the largest models. This makes them accessible for
  deployment on a wider range of hardware.
  4. Family: The Gemma family includes various sizes (e.g., different
  parameter counts) to suit different computational needs, ranging from
  smaller, highly efficient models to more capable ones.

Benchmark the Local Model

The local model was benchmarked:

● Create(/Users/xbill/.gemini/an...benchmark_comparison.md) (ctrl+o to expand)

  I have compiled a comparison report between the two benchmark results:

  • 4B Agent File: benchmark_results.csv
  • 2B Agent File: benchmark_results.csv

  The full analysis is saved in the artifact: benchmark_comparison.md.

  ### Summary Comparison

   Metric | Concurrency | 4B Model (m… | 2B Model (ma… | Speedup / Di…
  --------------|-------------|--------------|---------------|---------------
   Avg Latency | 1 | 1.57s | 0.59s | 2.66x faster
   (s) | | | | (2B)
                | 2 | 2.68s | 0.79s | 3.39x faster
                | | | | (2B)
   P95 Latency | 1 | 1.57s | 0.61s | 2.57x faster
   (s) | | | | (2B)
                | 2 | 3.02s | 1.01s | 2.99x faster
                | | | | (2B)
   Req/sec | 1 | 0.64 | 1.70 | +166% (2B)
                | 2 | 0.67 | 1.98 | +195% (2B)
   Tokens/sec | 1 | 20.44 | 27.25 | +33.3% (2B)
                | 2 | 21.57 | 31.70 | +47.0% (2B)

  ### Key Takeaways

  1. Performance: The 2B model is significantly faster across the board,
  processing requests with roughly 3x lower latency and achieving 47% more   
  tokens per second at concurrency 2.
  2. Scalability: While the 4B model shows a sharp latency increase from
  concurrency 1 to 2, the 2B model handles concurrency scaling up to 16 with
  very little performance degradation (latency remaining under ~0.83s).

This is visualized:

Summary

The strategy for using MCP for local Gemma 4 deployment with Antigravity CLI was validated with a incremental step by step approach.

Mix and Match: Running Kiro on Google Cloud Shell

xbill — Sun, 31 May 2026 12:12:24 +0000

This article provides a step by step deployment guide for using Amazon Kiro CLI on the Google Cloud Shell.

What is this project trying to Do?

This project aims to configure the Kiro CLI for seamless cross cloud operations across AWS and Google Cloud.

Aren’t You Running on the Wrong Cloud?

Mix and Match — Baby! The Antigravity CLI runs on AWS Cloud Shell:

AWS Cloud Shell with Antigravity CLI

So let’s give Kiro a shot on Google Cloud Shell!

What is the Google Cloud Shell Terminal and Editor?

The Google Cloud Shell- available on every Google Cloud project — comes with a built-in code editor that allows you to browse file directories, view and edit files, with continued access to the Cloud Shell. The Cloud Shell Editor is available by default with every Cloud Shell instance, and is based on Code OSS.

Where Can I find the Cloud Shell?

The Cloud Shell is provided on the Google Cloud Console for all Google Cloud Projects. It is available from the small icon on the upper right next to the Gemini Icon:

The Cloud Shell Editor is a basic hosted VSCode environment for rapid prototyping. It offers a minimal IDE environment to get started. For more in depth development-other options are available like Firebase Studio or developing from a GCE Cloud VM.

AWS Setup

The AWS CLI provides the basic tools for working with the AWS services:

AWS CLI

Check the AWS installation:

xbill9@cloudshell:~ (aisprint-491218)$ /usr/local/bin/aws --version
aws-cli/2.34.57 Python/3.14.5 Linux/6.6.137+ exe/x86_64.ubuntu.24
xbill9@cloudshell:~ (aisprint-491218)$

Once the tools are installed — login to the AWS console:

aws login --remote

What is Kiro

Kiro is an agentic, AI-powered Integrated Development Environment (IDE) built by Amazon Web Services (AWS) designed to help developers move from initial concepts to production-ready software. Based on a fork of VS Code (Code OSS), Kiro feels familiar while acting like an autonomous co-developer. [1, 2, 3]

Installing Kiro

The main download site for Kiro is:

https://kiro.dev/

Kiro: Bring engineering rigor to agentic development

for the CLI download — click CLI:

curl -fsSL https://cli.kiro.dev/install | bash

This will download Kiro to Google Cloud Shell:

Running Kiro

First — authenticate to Kiro:

Then confirm:

Verify The Installation

Kiro has full access to the local Google Cloud Shell Environment:

 summarize my google cloud project

● Shell gcloud config list 2>/dev/null && echo "---" && gcloud projects describe $(gcloud config get-value project
  2>/dev/null) 2>/dev/null
● Shell echo "=== Enabled APIs ===" && gcloud services list --enabled --format="value(name)" 2>/dev/null | sed
  's/.googleapis.com//' && echo "=== Compute Instances ===" && gcloud compute instances list 2>/dev/null && echo "=== Cloud
  Run Services ===" && gcloud run services list 2>/dev/null && echo "=== GCS Buckets ===" && gsutil ls 2>/dev/null && echo
  "=== Cloud Functions ===" && gcloud functions list 2>/dev/null && echo "=== Billing ===" && gcloud billing projects
  describe aisprint-491218 2>/dev/null
    === Enabled APIs ===
    projects/289270257791/services/addressvalidation
    projects/289270257791/services/aerialview
    projects/289270257791/services/agentregistry
    projects/289270257791/services/aiplatform
    ...+167 lines (ctrl+o to toggle)

  Here's a summary of your Google Cloud project:

  Project

  - Name: aisprint (ID: aisprint-491218)
  - Account: xbill9@gmail.com
  - Created: March 24, 2026 — Active

  Cloud Run Services (6 total, all deployed today/yesterday)

  ┌───────────────────────────┬─────────────┬─────────────┐
  │ Service │ Region │ Status │
  ├───────────────────────────┼─────────────┼─────────────┤
  │ gpu-4b-6000-devops-agent │ us-central1 │ ✅ │
  ├───────────────────────────┼─────────────┼─────────────┤
  │ gpu-4b-l4-devops-agent │ us-east4 │ ✅ │
  ├───────────────────────────┼─────────────┼─────────────┤
  │ gpu-26b-6000-devops-agent │ us-central1 │ ✅ │
  ├───────────────────────────┼─────────────┼─────────────┤
  │ gpu-26b-l4-devops-agent │ us-east4 │ ✅ │
  ├───────────────────────────┼─────────────┼─────────────┤
  │ gpu-31b-6000-devops-agent │ us-central1 │ ✅ │
  ├───────────────────────────┼─────────────┼─────────────┤
  │ gpu-31b-l4-devops-agent │ us-east4 │ ❌ (failed) │
  └───────────────────────────┴─────────────┴─────────────┘

  The naming pattern suggests GPU-accelerated devops agents in two GPU tiers (NVIDIA L4 and RTX 6000) across three model
  sizes (4B, 26B, 31B parameters).

  Storage

  - aisprint-491218-bucket
  - aisprint-491218-source-bucket
  - aisprint-491218_cloudbuild

  No active Compute instances or Cloud Functions.

  Enabled APIs — heavily focused on AI/ML:

  - AI Platform, Vertex AI (aiplatform), Generative Language, Gemini, ML
  - Maps Platform (full suite), BigQuery, Cloud Run, Cloud Build
  - Secret Manager, Pub/Sub, Dataflow, GKE, Artifact Registry
  - Google Workspace APIs (Gmail, Drive, Calendar, Chat) + MCP variants

  The project is clearly an AI agent platform, running GPU-backed LLM inference services on Cloud Run with GPU support.

Test The AWS Integration

This step tests the Kiro interactions with the remote AWS environment:

 Compute

  - 1 EC2 instance: i-03456018d00d13504 (t3.small, stopped)
  - No active ECS clusters or EKS clusters running (stacks exist, clusters likely deleted)

  Lambda

  - AgentCore-mcpBedrockRustA-ApplicationAgentMcpServe-rWGgAcNdRUnH

  S3 Buckets (5)

  - bedrock-agentcore-runtime-... (us-west-2)
  - cdk-hnb659fds-assets-... (CDK bootstrap)
  - elasticbeanstalk-us-east-1-...
  - my-adk-comic-artifacts
  - xbill-test

  CloudFormation Stacks (13 active)

  - Multiple AgentCore-* stacks (Bedrock, ADK, MCP variants)
  - eksctl-* stacks for 3 EKS clusters (mcp-eks, adkui-eks, biometric-scout) — infrastructure stacks remain even if clusters
  are torn down
  - CDKToolkit bootstrap

  ECR Repositories (33)

  - Heavy focus on AI agent workloads: course creator pipelines (orchestrator, researcher, content-builder, judge), MCP
  servers, ADK agents, biometric-scout
  - Both Python and Rust MCP server variants

  No active: RDS, EKS clusters, ECS, Elastic Beanstalk environments

  ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

  The environment is clearly an AI agent development platform — experimenting with Bedrock AgentCore, ADK (Agent Development
  Kit), and MCP (Model Context Protocol) server patterns, with both Python and Rust implementations.

So Why? Just Why? Why Mix and Match Cloud Setups?

Kiro was used to summarize the advantages of the approach:

  summarize the benefits of using Kiro from the Google Cloud Shell

  Based on your specific environment — an AI agent developer working across GCP and AWS from Cloud Shell — here's where Kiro
  adds real value:

  Cross-Cloud Visibility in One Place

  You just saw this: Kiro can summarize your GCP project and AWS environment in a single conversation, correlating context
  across both clouds without switching tools or consoles.

  Codebase Intelligence

  Cloud Shell gives you a terminal; Kiro gives you a coding partner. For your agent repos (ADK, MCP servers, Bedrock
  AgentCore), Kiro can read, understand, and modify code across files — not just run commands.

  Infrastructure as Code Assistance

  With 13 CloudFormation stacks and CDK in use, Kiro can help write, debug, and refactor IaC — understanding the intent
  behind stacks like eksctl-adkui-eks-cluster rather than just listing them.

  Agentic Workflows Without Context Switching

  You're building AI agents. Kiro can help you design, implement, and deploy them — writing the Cloud Run service configs,
  Dockerfiles, Lambda handlers, and Bedrock agent definitions — all from the same shell session where you're already
  working.

  Persistent Task Execution

  Unlike a plain terminal, Kiro tracks multi-step tasks, catches errors, and adapts — useful when deploying across regions
  (your us-central1 / us-east4 Cloud Run pattern) or debugging failed services like gpu-31b-l4-devops-agent.

  No Extra Tooling

  Cloud Shell already has gcloud, aws, docker, git, and python — everything Kiro needs. There's nothing to install; you get
  AI-assisted development on top of the environment you already have.

Summary

The Google Cloud Shell and Cloud Shell Terminal provides an environment for rapid prototyping Agentic development. With some configuration and customization- it provides core functionality cross cloud using the Kiro CLI.

Untrusted Code, Trusted Cluster Scaling Secure AI Agent Workspaces with GKE Agent Sandbox

Saurabh Mishra — Sun, 31 May 2026 04:03:42 +0000

How gVisor-powered sandbox isolates AI-generated code at the kernel level and why it changes everything for multi-tenant agentic systems.

In this article we are going discuss on below points

The problem with AI agents writing code
What is GKE Agent Sandbox?
How gVisor intercepts the kernel
Architecture deep dive
Setting it up: step by step
Production patterns
Conclusion

There's a moment every engineer running AI agents eventually faces: an LLM generates a perfectly plausible subprocess.run() call, pipes it to bash -c, and realise that one prompt injection away from a full container escape. The code looks reasonable. The agent trusts itself. And cluster's blast radius just became everyone's problem.

This is the defining security problem of the agentic era. Language models don't just generate text anymore they write, execute, and iterate on code in tight feedback loops. The capabilities that make them useful (unrestricted Python, shell access, file I/O) are exactly the capabilities that make them dangerous in a shared cluster.

Google's answer — GKE Agent Sandbox

GKE Agent Sandbox is built for agentic workloads that require high-level scale, extensibility, and security. Key benefits include:

Kernel-level isolation: Provides strong, kernel-level isolation for untrusted, LLM-generated code by using built-in GKE features like GKE Sandbox. Agent Sandbox also supports the open source Kata Containers software.

Sub-second provisioning: Offers an out-of-the-box mechanism to provide sandboxes significantly faster than standard Kubernetes Pod scheduling allows (typically <1s).

Cloud-native extensibility: Leverages the power of the Kubernetes paradigm and the managed infrastructure of GKE.

By providing a declarative, standardized API, GKE Agent Sandbox offers a single-container experience that provides isolation and persistence characteristics similar to a virtual machine (VM), built entirely on Kubernetes primitives

The problem with AI agents writing code

Agentic AI systems whether you're building with LangGraph, AutoGen, Claude's tool-use API, or rolling your own share a common architectural pattern: the model generates code, a runtime executes it, results flow back to the model, and the loop continues. At each iteration, the model has broader context about what worked and what didn't. This is enormously powerful for automating complex tasks.

It also creates an attack surface that traditional Kubernetes security was never designed to handle.

Container escape

LLM-generated code exploits known kernel vulnerabilities or misconfigured capabilities to break out of the container boundary.

Prompt injection via code output

Malicious content in retrieved data embeds instructions that manipulate the agent into executing attacker-controlled payloads.

Lateral network movement

An agent with network access can enumerate internal services, extract credentials, and pivot across your cluster — all through legitimate-looking Python requests.

Filesystem exfiltration

Without mount restrictions, agents can read service account tokens, Kubernetes secrets mounted as volumes, and host path data.

Standard container security — securityContext, network policies, Pod Security Admission provides defence in depth but doesn't address the fundamental issue: containers share the host kernel. If the kernel has a vulnerability, a sufficiently motivated attacker (or sufficiently capable LLM) can exploit it regardless of namespace isolation.

What is GKE Agent Sandbox?
GKE Agent Sandbox is a Google-managed node pool configuration that applies gVisor-based container sandboxing specifically tuned for agentic AI workloads.

At its core, it combines three things:

gVisor runtime (runsc) as the default OCI runtime
Every container in the sandbox node pool runs under runsc instead of the standard runc. This intercepts all syscalls through a user-space kernel implementation called Sentry.

Agent-specific resource isolation profiles
Pre-configured seccomp and AppArmor profiles optimised for Python/Node.js/container-in-container workloads that AI agents commonly generate. No manual tuning of syscall allowlists required for standard use cases.

Integrated observability via Cloud Monitoring
Syscall audit logs, sandbox violation events, and resource consumption metrics flow automatically into Cloud Monitoring — giving you behavioural baselines for agent workloads without custom instrumentation.

How gVisor intercepts the kernel

Understanding what gVisor actually does is essential for reasoning about its security guarantees. The mental model most engineers have of containers — "a process with namespaces and cgroups" — breaks down when thinking about gVisor.

In a standard container, your application's open(), read(), execve(), and socket() calls go directly to the host Linux kernel via the system call interface. The kernel has to handle them, which means a kernel vulnerability is reachable from inside the container.

With gVisor, those same syscalls are intercepted by Sentry a Go implementation of the Linux kernel that runs entirely in user space. Sentry implements the Linux ABI from scratch. When your agent code calls execve(), it's Sentry that handles it, not the host kernel. Sentry then makes a much smaller set of calls to the actual host kernel (through a restricted interface called the "platform") to handle things like memory mapping and scheduling.

End-to-End Architectural Blueprint

To isolate untrusted code execution while maintaining a highly responsive management plane, the architecture splits the cluster into two distinct, specialized node pools.

Standard Node Pool (The Brain)- This pool runs your trusted, long-lived orchestration services. Because this code is written and audited by your team, it runs on the standard Linux host kernel for maximum performance and native access to internal cluster resources.Agent Controller: The core engine managing the life cycle of AI agent tasks, spin-up times, and state tracking.Tool Router: Mediates external API calls and manages what capabilities (e.g., web search, database querying) are exposed to the agent.Result Collector: Aggregates outputs, logs, and state changes from the runtime pods.State & Storage (Postgres/Redis): Highly available data layers tracking session memory and agent state.

Agent Sandbox Node Pool (The Muscle) - This pool is dedicated entirely to executing untrusted code generated by AI models. It uses the runtimeClassName: gvisor configuration to enforce strict kernel-level isolation.Code Executor Pods ($N$ Pods): Ephemeral, rapid-churn pods designed to spin up, run a specific snippet of generated code, and terminate.The Sentry (User-Space Kernel): gVisor’s core component. Instead of letting a Python agent talk directly to the host Linux kernel via standard system calls (syscall()), the Sentry intercepts them. It implements a core suite of Linux kernel primitives in user-space, shielding the host bare-metal or VM infrastructure from container escape vulnerabilities.

Workload Identity & RBAC Separation

By separating Kubernetes Service Accounts (KSAs) and mapping them to distinct Google Cloud IAM Service Accounts, we eliminate the risk of privilege escalation if an agent is compromised.

Observability and Behavioral Analysis

Because sandbox runtimes are naturally adversarial, observability shifts from standard application performance monitoring (APM) to real-time behavioral and security auditing

Syscall Audit Logs: gVisor provides structural logs of intercepted system calls via its internal logging mechanisms. Unusual system calls (e.g., attempts to call forbidden network protocols or direct raw socket manipulations) are immediately streamed to Cloud Logging.

Violation Events: Any attempt by a sandboxed container to bypass the Sentry or execute an invalid operation triggers an immediate containment event, surfaced directly in Google Cloud Security Command Center.

Cloud Monitoring: Aggregates container-level metrics (CPU, Memory, Churn rate). Crucial for detecting malicious infinite loops or resource-exhaustion (DDoS) attempts disguised as AI agent tasks.

Cloud Trace: End-to-end distributed tracing maps exactly how long a request spends routing through the Tool Router versus how long it spends executing inside the gVisor sandbox, allowing you to fine-tune the performance overhead introduced by user-space context switching.

Setting it up: step by step

Here's a complete walkthrough from a fresh GKE cluster to a running sandboxed agent workload. This assumes you have gcloud, kubectl, and Terraform configured for project.

Production patterns

Pattern 1: Warm pool with pre-forked executors

Cold-starting a new pod for every code execution adds latency. The standard pattern is to maintain a pool of warm executor pods that listen for work over a task queue (Pub/Sub or Redis Streams). The controller dispatches code snippets to idle executors; completed executors reset their environment and return to the pool. A garbage collection sidecar restarts pods that have been warm too long to prevent state accumulation.

Pattern 2: Execution budget enforcement

AI agents can get into infinite loops. Beyond Kubernetes resource limits, apply an application-level timeout using Python's signal.alarm or Go's context cancellation. A 30-second wall-clock timeout with a 10-second CPU-time budget covers almost all legitimate agent code execution patterns while preventing runaway loops from consuming pool capacity.

Pattern 3: Network egress allow-listing per agent type

Different agent personas have different legitimate network needs. A data analysis agent needs access to BigQuery and GCS. A web research agent needs HTTP egress to public internet. A code review agent needs neither. Model this with separate NetworkPolicies per agent label, and use PodSpec labels to bind agents to the right policy at scheduling time.

Conclusion

The agentic era is here, and it runs on code execution. Whether you're building autonomous research assistants, DevOps automation agents, or data pipeline orchestrators,eventually going to need a principled answer to the question: what happens when the model writes something it shouldn't?

GKE Agent Sandbox doesn't make the threat go away. Prompt injection is still a model-level problem. Lateral movement still requires complementary network controls. Secrets management still requires RBAC discipline. But the sandbox answers a specific, hard question — what if agent-generated code exploits a kernel vulnerability or escalates privileges? — with a credible, production-tested answer: it runs against Sentry, not your host kernel.

For most teams running agentic workloads on GKE, the operational cost is low (a single node pool configuration), the performance cost is acceptable (single-digit percentages for typical agent workload patterns), and the security benefit is significant (kernel-level isolation with full Kubernetes observability).

That's the architectural question GKE Agent Sandbox is designed to answer. Build agentic systems with the assumption that the code will sometimes be wrong, sometimes be manipulated, and occasionally be malicious and design your execution environment accordingly.

References and Documentation

https://docs.cloud.google.com/kubernetes-engine/docs/how-to/agent-sandbox

https://docs.cloud.google.com/kubernetes-engine/docs/concepts/machine-learning/agent-sandbox

26B Gemma 4 Deployment with NVIDIA Blackwell 6000, MCP, Cloud Run, and Antigravity CLI

xbill — Sat, 30 May 2026 17:27:07 +0000

What is this project trying to Do?

This project is similar to a previous project that targeted GPU hosted Gemma4 instances on GCP:

Gemma-SRE: Self-Hosted vLLM Infrastructure Agent

Antigravity CLI

Antigravity CLI is the follow-on successor to Gemini CLI- the terminal driven, agent assisted coding tool.

Full details on installing Antigravity CLI are here:

Getting Started with Antigravity CLI

Testing the Antigravity CLI Environment

Once you have all the tools in place- you can test the startup of Antigravity CLI.

You will need to authenticate with a Google Cloud Project or your Google Account:

agy

This will start the interface:

Full Installation Instructions

The detailed installation instructions for Antigravity CLI are here:

Getting Started with Antigravity CLI

Python MCP Documentation

The official GitHub Repo provides samples and documentation for getting started:

GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients

Where do I start?

The strategy for starting MCP development for model management is a incremental step by step approach.

First, the basic development environment is setup with the required system variables, and a working Antigravity CLI configuration.

Then, a minimal Python MCP Server is built with stdio transport. This server is validated with Antigravity CLI in the local environment.

This setup validates the connection from Antigravity CLI to the local server via MCP. The MCP client (Antigravity CLI) and the Python MCP server both run in the same local environment.

Setup the Basic Environment

At this point you should have a working Python environment and a working Antigravity CLI installation. The next step is to clone the GitHub samples repository with support scripts:

cd ~
git clone https://github.com/xbill9/gemma4-tips

Then run init.sh from the cloned directory.

The script will attempt to determine your shell environment and set the correct variables:

gpu-26B-6000-devops-agent
source init.sh

If your session times out or you need to re-authenticate- you can run the set_env.sh script to reset your environment variables:

gpu-26B-6000-devops-agent
source set_env.sh

Variables like PROJECT_ID need to be setup for use in the various build scripts- so the set_env script can be used to reset the environment if you time-out.

Model Management Tool with MCP Stdio Transport

One of the key features that the standard MCP libraries provide is abstracting various transport methods.

The high level MCP tool implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.

The connection over stdio will look similar to this:

# Initialize FastMCP server
mcp = FastMCP("Self-Hosted vLLM DevOps Agent")

Running the Python Code

First- switch the directory with the Python version of the MCP sample code:

~/gemma4-tips/gpu-6000-devops-agent

Run the release version on the local system:

make install
Processing ./.

The project can also be linted:

xbill@penguin:~/gemma4-tips/gpu-26B-6000-devops-agent$ make lint
ruff check .
All checks passed!
ruff format --check .
6 files already formatted
mypy .
Success: no issues found in 6 source files

And a test run:

xbill@penguin:~/gemma4-tips/gpu-26B-6000-devops-agent$ make test
python test_agent.py
2026-05-29 18:09:59,458 - vllm-devops-agent - INFO - Initializing DevOps Agent MCP Server...
.........2026-05-29 18:09:59,501 - vllm-devops-agent - INFO - Querying Cloud Run model with prompt: 'Hello...'
2026-05-29 18:09:59,501 - vllm-devops-agent - INFO - Model response: 'Response from Gemma...'
.2026-05-29 18:09:59,502 - vllm-devops-agent - INFO - Querying model with stats with prompt: 'Hello...'
2026-05-29 18:09:59,503 - vllm-devops-agent - INFO - Model response with stats: TTFT=0.000s, TotalTime=0.000s
.......
----------------------------------------------------------------------
Ran 17 tests in 0.024s

OK

MCP stdio Transport

One of the key features that the MCP protocol provides is abstracting various transport methods.

The high level tool MCP implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.

In this project Antigravity CLI is used as the MCP client to interact with the Python MCP server code.

Antigravity CLI mcp_config.json

A sample MCP server file is provided in the .agents directory:

{
  "mcpServers": {
    "gpu-26b-6000-devops-agent": {
      "command": "python3",
      "args": [
        "/home/xbill/gemma4-tips/gpu-26B-6000-devops-agent/server.py"
      ],
      "env": {
        "GOOGLE_CLOUD_PROJECT": "aisprint-491218",
        "GOOGLE_CLOUD_LOCATION": "us-central1",
        "VLLM_BASE_URL": "https://gpu-26b-6000-devops-agent-wgcq55zbfq-uc.a.run.app",
        "MODEL_NAME": "/mnt/models/gemma-4-26B-A4B-it"
      }
    }
  }
}

Validation with Antigravity CLI

The final connection test uses Antigravity CLI as a MCP client with the Python code providing the MCP server:

MCP Servers

Plugins (~/.gemini/antigravity-cli/plugins)
> ✓ google-dev-knowledge Tools: search_documents, answer_query, get_documents
   ✓ gpu-26b-6000-devops-agent Tools: save_hf_token, get_vllm_endpoint, list_vertex_models, list_bucket_models,
                                analyze_cloud_logging, +19 more

Getting Started with Gemma 4 on GPU

The Official vLLM repo also has Gemma4 specific information:

Releases · vllm-project/vllm

Model Lifecycle Management via MCP

The MCP tools provide a complete suite of agent-oriented operations for managing vLLM deployment on Cloud Run or a TPU.

Overview of MCP tools :

> get_help

● gpu-26b-6000-devops-agent/get_help(Get help information) (ctrl+o to expand)

  Here is the help and active configuration information for the Gemma 4 DevOps/SRE Agent:

  ### ⚙️ Active Environment Configuration

  • GCP Project ID ( GOOGLE_CLOUD_PROJECT ): aisprint-491218 
  • GCP Region ( GOOGLE_CLOUD_LOCATION ): us-central1 
  • GCS Model Bucket ( BUCKET_NAME ): aisprint-491218-bucket 
  • Model Repo/Path ( MODEL_NAME ): /mnt/models/gemma-4-26B-A4B-it 
  • Cloud Run URL ( VLLM_BASE_URL ): https://gpu-26b-6000-devops-agent-wgcq55zbfq-uc.a.run.app
  • Mode: Cloud Run mode (targeting NVIDIA RTX PRO 6000 GPU in us-central1 ).
  ──────
  ### 🧰 Available MCP Tools

  #### 🐳 Infrastructure & Deployment

  • deploy_vllm : Deploys vLLM to Cloud Run GPU (NVIDIA RTX PRO 6000 in us-central1 ).
  • destroy_vllm : Deletes the Cloud Run vLLM service.
  • status_vllm : Checks the status of the Cloud Run vLLM service.
  • update_vllm_scaling : Updates min/max instances for scaling.
  • get_vllm_deployment_config : Generates the gcloud deployment command.
  • get_vllm_gpu_deployment_config : Generates a GKE manifest for GPU (NVIDIA L4).
  • check_gpu_quotas : Checks RTX 6000 and other GPU quotas for a region.

  #### 📦 Model Management

  • list_vertex_models : Lists models in the Vertex AI Registry.
  • list_bucket_models : Lists model weights in the GCS bucket.
  • save_hf_token : Securely saves a Hugging Face API token to Secret Manager.
  • get_vertex_ai_model_copy_instructions : Instructions to copy models from Vertex AI Model Garden to GCS.
  • get_huggingface_model_copy_instructions : Instructions to download from Hugging Face and upload to GCS.
  • get_huggingfacehub_download_path : Resolves the local cache path using huggingface_hub .

  #### 📊 Monitoring & Status

  • get_system_status : Provides a high-level status dashboard of the Cloud Run service and health.
  • get_endpoint : Verifies connectivity and returns the active service URL.
  • get_model_details : Retrieves detailed model metadata and engine state from /v1/models .
  • verify_model_health : Deep health check by querying the model with a simple prompt and measuring latency.

  #### 📈 Performance & Benchmarking

  • run_benchmark : Runs performance/concurrency benchmark sweeps against the Cloud Run vLLM GPU endpoint.

  #### 💬 Interaction & Diagnostics

  • query_gemma4 : Primary tool to query the self-hosted model with standard chat message format.
  • query_gemma4_with_stats : Queries the model and returns streaming performance statistics (TTFT, throughput).
  • query_vllm : Direct text completions querying tool.
  • analyze_cloud_logging : Fetches logs from GCP Logging and analyzes them using the model.
  • analyze_gpu_logs : Fetches Cloud Run logs and uses Gemma 4 to analyze them for SRE/DevOps errors.
  • suggest_sre_remediation : Suggests remediation plans for SRE errors using the model.

> list mcp tools

  Here is the list of available MCP servers and their corresponding tools:

  ### 1. gpu-26b-6000-devops-agent (GPU 26B DevOps Agent)

  • deploy_vllm : Deploys vLLM to Cloud Run GPU (NVIDIA RTX PRO 6000 in us-central1 ).
  • destroy_vllm : Deletes the Cloud Run vLLM service.
  • status_vllm : Checks the status of the Cloud Run vLLM service.
  • update_vllm_scaling : Updates min/max instances for scaling.
  • get_vllm_deployment_config : Generates the gcloud deployment command.
  • get_vllm_gpu_deployment_config : Generates a GKE manifest for GPU (NVIDIA L4).
  • check_gpu_quotas : Checks RTX 6000 and other GPU quotas for a region.
  • list_vertex_models : Lists models in the Vertex AI Registry.
  • list_bucket_models : Lists model weights in the GCS bucket.
  • save_hf_token : Securely saves a Hugging Face API token to Secret Manager.
  • get_vertex_ai_model_copy_instructions : Instructions to copy models from Vertex AI Model Garden to GCS.
  • get_huggingface_model_copy_instructions : Instructions to download from Hugging Face and upload to GCS.
  • get_huggingfacehub_download_path : Resolves the local cache path using huggingface_hub .
  • get_system_status : Provides a high-level status dashboard of the Cloud Run service and health.
  • get_endpoint : Verifies connectivity and returns the active service URL.
  • get_model_details : Retrieves detailed model metadata and engine state from /v1/models .
  • verify_model_health : Deep health check by querying the model with a simple prompt and measuring latency.
  • run_benchmark : Runs performance/concurrency benchmark sweeps against the Cloud Run vLLM GPU endpoint.
  • query_gemma4 : Primary tool to query the self-hosted model with standard chat message format.
  • query_gemma4_with_stats : Queries the model and returns streaming performance statistics (TTFT, throughput).
  • query_vllm : Direct text completions querying tool.
  • analyze_cloud_logging : Fetches logs from GCP Logging and analyzes them using the model.
  • analyze_gpu_logs : Fetches Cloud Run logs and uses Gemma 4 to analyze them for SRE/DevOps errors.
  • suggest_sre_remediation : Suggests remediation plans for SRE errors using the model.
  • get_help : Gets help information, environment configurations, and active mode details.

Deploy The Model

The MCP server tools are used to deploy the model to Cloud Run with TPU:

deploy the model

  ### Deployment Status Summary

  The Gemma 4 26B model has been successfully deployed to Google Cloud Run in us-central1 using the NVIDIA RTX 6000 (
  nvidia-rtx-pro-6000 ) GPU accelerator!

  Here is the current state of the deployments:

  1. Active & Serving Revision ( vllm-gemma-4-26b-a4b-it-00003-zfm ) — 100% Traffic:
      • Status: Healthy & Active (Successfully booted and serving inference).
      • GPU Accelerator: nvidia-rtx-pro-6000 (NVIDIA L4 RTX 6000)
      • Resources: 20 vCPUs, 80 GiB memory, 1 GPU.
      • vLLM Configuration: Loaded weights directly from Hugging Face using the HF_TOKEN secret. The startup,
      initialization, and Triton attention backend compilations completed successfully.
      • Service Endpoint: https://vllm-gemma-4-26b-a4b-it-289270257791.us-central1.run.app 
  2. GCS FUSE Revision ( vllm-gemma-4-26b-a4b-it-00004-2nk ) — Provisioning:
      • Status: Loading Safetensors Shards (Mounting from aisprint-491218-bucket/gemma-4-26B-A4B-it/ ).
      • We successfully cached the model weights to the GCS bucket ( gs://aisprint-491218-bucket/gemma-4-26B-A4B-it/ )
      and deployed using GCS FUSE volume mounting.
      • Note: Loading via GCS FUSE is experiencing slower shard loading speeds due to GCS FUSE random access ( mmap )
      latency overhead. Since Revision 00003-zfm is fully ready and serving 100% of the traffic, you can query the
      endpoint immediately.

Checking System status

The status can be checked with an MCP tool:

> check_status

● gpu-26b-6000-devops-agent/get_system_status(System status check) (ctrl+o to expand)

  The system is running and fully operational:

  • vLLM Health: 🟢 Online ( https://gpu-26b-6000-devops-agent-289270257791.us-central1.run.app )
  • Cloud Run Service Status: 🟢 Ready

  You can interact with the model using query_gemma4 .

Cross Check The Deployed Model

Once the model starts locally- the MCP tool allows for model verification:


 > verify_model_health

● gpu-26b-6000-devops-agent/verify_model_health(Model health verification) (ctrl+o to expand)

  The model health check has passed successfully:

  • Status: ✅ PASSED
  • Target Model: gpu-26b-6000-devops-agent 
  • Latency: 0.17 seconds
  • Model Response: "Yes, I am working! How can I help you today?..."

and model details:

> get_model_details

● gpu-26b-6000-devops-agent/get_model_details(Model details retrieval) (ctrl+o to expand)

  Here are the details for the currently active model:

  ### 🧩 Model Details

  • Endpoint: https://gpu-26b-6000-devops-agent-289270257791.us-central1.run.app 
  • Health Status ( /health ): Healthy ✅

  #### Model Information ( /v1/models )

    [
      {
        "id": "gpu-26b-6000-devops-agent",
        "object": "model",
        "owned_by": "vllm"
      }
    ]

and the endpoint:

> get_endpoint

● gpu-26b-6000-devops-agent/get_endpoint(Endpoint retrieval) (ctrl+o to expand)

  The Cloud Run vLLM endpoint is:

  🟢 https://gpu-26b-6000-devops-agent-289270257791.us-central1.run.app

and status:

> get_system_details

● gpu-26b-6000-devops-agent/get_help(Agent help query) (ctrl+o to expand)

  Here are the active configuration and system details:

  ### ⚙️ System Configuration

  • Active Mode: Cloud Run (NVIDIA RTX PRO 6000 GPU) in region us-central1 
  • GCP Project ID ( GOOGLE_CLOUD_PROJECT ): aisprint-491218 
  • GCP Region ( GOOGLE_CLOUD_LOCATION ): us-central1 
  • GCS Bucket Name ( BUCKET_NAME ): aisprint-491218-bucket 
  • Model Path ( MODEL_NAME ): /mnt/models/gemma-4-26B-A4B-it 
  • vLLM Endpoint URL ( VLLM_BASE_URL ): https://gpu-26b-6000-devops-agent-289270257791.us-central1.run.app 

  ### 📊 Deployment Status

  • vLLM Health: 🟢 Online
  • Cloud Run Service Status: 🟢 Ready

Check the Cloud Run deployment status:

> show the cloud run docker status

● ListDir(/home/xbill/gemma4-tips/gpu-26B-6000-devops-agent)
● Bash(gcloud run services describe gpu-26b-6000-devops-agent --region us-central1 --format="yaml") (ctrl+o to expand)

  Here is the detailed status and configuration of the Docker container running on Cloud Run:

  ### 🐳 Docker & Container Configuration

  • Docker Image: vllm/vllm-openai:latest 
  • Target Port: 8080 (HTTP)
  • Service Account: 289270257791-compute@developer.gserviceaccount.com 

  #### ⚙️ Resources & Limits

  • CPU Limit: 20 CPUs
  • Memory Limit: 80Gi 
  • GPU Limit: 1 (NVIDIA RTX PRO 6000 via nvidia-rtx-pro-6000 node selector)
  • Container Concurrency: 16 
  • Timeout: 3600 seconds (1 hour)

  #### 📦 Volume Mounts

  • Driver: gcsfuse.run.googleapis.com (GCS FUSE)
  • Bucket: aisprint-491218-bucket 
  • Mount Path: /mnt/models (mounted read-only)

  #### 🚀 Execution Command

    bash -c "vllm serve /mnt/models/gemma-4-26B-A4B-it \
      --served-model-name gpu-26b-6000-devops-agent \
      --enable-log-requests \
      --enable-chunked-prefill \
      --enable-prefix-caching \
      --generation-config auto \
      --enable-auto-tool-choice \
      --tool-call-parser gemma4 \
      --reasoning-parser gemma4 \
      --dtype bfloat16 \
      --quantization fp8 \
      --kv-cache-dtype fp8 \
      --max-num-seqs 8 \
      --gpu-memory-utilization 0.95 \
      --tensor-parallel-size 1 \
      --load-format runai_streamer \
      --port 8080 \
      --host 0.0.0.0 \
      --max-model-len 32767"

Check the Cloud Run Service

Once Deployed — the model is visible as a standard Cloud Run Service:

Review the Model

Check the model details:

> what are the model details?

> what are the model details?

● gpu-26b-6000-devops-agent/get_model_details(Model details retrieval) (ctrl+o to expand)

  The model details are as follows:

  • Serving Endpoint: https://gpu-26b-6000-devops-agent-289270257791.us-central1.run.app 
  • Model ID (served name): gpu-26b-6000-devops-agent 
  • Owner: vllm 
  • Engine Health Status: Healthy ✅
  • Served Model Source Path: /mnt/models/gemma-4-26B-A4B-it (loaded from the bucket aisprint-491218-bucket via GCS
  FUSE).

Once everything is online — you can send queries to the Model:

> query_gemma4_with_stats what is gemma?
● gpu-26b-6000-devops-agent/query_gemma4_with_stats(Query model with stats) (ctrl+o to expand)

  Here are the performance stats and the model's explanation of Gemma:

  ### 📊 Performance Stats

  • Model: gpu-26b-6000-devops-agent 
  • Time to First Token (TTFT): 0.123s 
  • Total Generation Time: 4.329s 
  • Throughput: 169.07 tokens/s 
  • Total Tokens (approx.): 711 
  ──────
  ### 💬 Model Response

  Gemma is a family of lightweight, state-of-the-art open models developed by Google DeepMind.

  If you are familiar with Google’s Gemini (their massive, closed-source AI), you can think of Gemma as the "smaller,
  open sibling." While Gemini is built to power massive consumer applications, Gemma is built for developers,
  researchers, and enthusiasts to run on their own hardware.

  Here is a breakdown of what makes Gemma significant:

  #### 1. "Open Weights" vs. "Open Source"

  It is technically more accurate to call Gemma an open weights model rather than fully "open source."

  • What this means: Google has released the "brain" (the weights/parameters) of the model. This allows you to download
  it, run it on your own laptop or private server, and fine-tune it for specific tasks (like medical coding or creative
  writing).
  • The Difference: You can use it freely, but you don't necessarily have access to the raw training data or the exact
  proprietary training code used by Google.

  #### 2. Built from Gemini Technology

  Gemma was built using the same research, technology, and infrastructure used to create the Gemini models. This means
  that even though Gemma is much smaller, it inherits much of the "reasoning" capability, mathematical ability, and
  linguistic nuance found in Google's most powerful AI.

  #### 3. Key Characteristics

  • Lightweight and Efficient: Unlike massive models that require a room full of servers, Gemma is designed to be small
  enough to run on a high-end consumer laptop or a single GPU.
  • Variety of Sizes: Gemma is released in different sizes (measured in "parameters," such as 2B, 7B, 9B, or 27B).
      • A 2B (2 billion) model is tiny and can run on a phone or a basic laptop.
      • A 27B (27 billion) model is much more intelligent but requires more powerful hardware.
  • Multimodal Capabilities: Newer versions (like PaliGemma) are designed to understand both text and images.

  #### 4. Why does Gemma exist? (The Purpose)

  Google released Gemma to support the global developer community. By providing Gemma, they are:

  • Empowering Developers: Allowing people to build specialized AI apps without paying massive API fees to Google or
  OpenAI.
  • Research: Enabling scientists to study how LLMs (Large Language Models) work, which is harder to do with "closed"
  models like GPT-4.
  • Customization: Allowing a company to take Gemma and "teach" it their specific private data without that data ever
  leaving their secure servers.

  #### Summary Table

   Feature | Gemini | Gemma
  -------------------------------|----------------------------------------|--------------------------------------------
   Access | Closed (via API/App) | Open (Downloadable weights)
   Size | Massive (Trillions of parameters) | Small to Medium (Billions)
   Hardware | Google's Supercomputers | Laptops, Desktops, Private Clouds
   Best For | General purpose, high-complexity tasks | Specific tasks, local privacy, development

Benchmark the Local Model

The local model was benchmarked:

The benchmark sweep has completed successfully. Here are the compiled results across all combinations of context
  window sizes and concurrency levels:

I have completed the benchmark sweep and generated a full performance report.

The benchmark tested 1 to 256 concurrent users across context windows ranging from 8 to 16,384 tokens on the vLLM
  Cloud Run instance (NVIDIA RTX 6000 GPU).

### 📈 Benchmark Results Summary

• Continuous Batching Scaling: The RTX 6000 GPU handles high concurrency remarkably well. For a standard 512-token   
  context, aggregate throughput scaled from 138.52 tokens/s (1 user) to 1,136.26 tokens/s (256 concurrent users).
  • Large Context Performance (16,384 tokens): Even under maximum context sizes (16k tokens) and extreme concurrency
  (256 users), the model achieved an aggregate throughput of 849.39 tokens/s with a 100% success rate.
  • Low Latency: Average latency stays under 2.2 seconds for up to 16 concurrent users across all context lengths.

The analysis is reviewed:

  Here is a hardware-focused Deep Dive comparing the NVIDIA RTX 6000 (96GB/48GB VRAM) and NVIDIA L4 (24GB VRAM) for
  serving Gemma 4 models.
  ──────
  ### 📊 GPU Performance Comparison Chart

  🖼 Image: NVIDIA RTX 6000 vs NVIDIA L4 Performance Comparison → /home/xbill/.gemini/antigravity-cli/brain/2d9e479f-
  08d1-
  4cb8-826d-f51e949ae228/gpu_comparison_chart.png
  ──────
  ### 🔍 Architectural Comparison: RTX 6000 vs. NVIDIA L4

   Specification | NVIDIA RTX 6000 | NVIDIA L4
  --------------------------------------|-------------------------------------|----------------------------------------
   VRAM | 96 GB (or 48 GB) GDDR6 ECC | 24 GB GDDR6
   Memory Bandwidth | 960 GB/s | 300 GB/s
   FP32 Compute | 91.1 TFLOPS | 30 TFLOPS
   Tensor Cores | 568 (Gen 4) | 242 (Gen 4)
   Power Consumption | 300W | 72W
   Best Target Model | Gemma 4 26B / 31B (Large models) | Gemma 4 E4B / 9B (Medium/small models)
  ──────
  ### 💡 Key Takeaways from the Deep Dive

  #### 1. Memory Bandwidth Bottlenecks

  • Large Language Models (LLMs) are highly memory bandwidth-bound during the token-generation phase (decoding). Every
  single token generation cycle requires reloading the entire model's weights from high-speed memory into the
  processing cores.
  • The RTX 6000 (960 GB/s) has 3.2x higher memory bandwidth than the NVIDIA L4 (300 GB/s). This means it can fetch
  weights and process individual sequence updates 3 times faster, directly resulting in the extremely high tokens-per-
  second outputs observed in the benchmark.

  #### 2. VRAM Capacity & Batch Size Limits

  • The NVIDIA L4 has only 24 GB of VRAM. To serve a model like Gemma 4 26B, you are forced to use heavy quantization
  (such as FP4 or INT4), which degrades model accuracy.
  • Even with quantization, the L4 will have very little remaining VRAM for the KV cache, capping maximum concurrency
  (often failing or queueing heavily above 8–16 concurrent users).
  • The RTX 6000 (96 GB) allows loading the model with full fidelity (BF16) or using FP8 quantization to leave up to 70
  GB of VRAM dedicated entirely to the KV Cache. This allows for massive concurrent execution up to 256 users at a 16k
  context window with 100% success rate.

  #### 3. Core Compute and Throughput Scaling

  • With 568 Gen-4 Tensor Cores, the RTX 6000 provides massive parallel matrix multiplication capabilities. Under heavy
  concurrency, the GPU can continuous-batch hundreds of sequences together, keeping the Tensor cores highly saturated
  and hitting peak throughput numbers above 1,100 tokens/second.

This is visualized:

Summary

The strategy for using MCP for Gemma 4 GPU deployment with Antigravity CLI was validated with a incremental step by step approach.

31B Gemma 4 Deployment with NVIDIA Blackwell 6000, MCP, Cloud Run, and Antigravity CLI

xbill — Fri, 29 May 2026 21:53:40 +0000

What is this project trying to Do?

This project is similar to a previous project that targeted GPU hosted Gemma4 instances on GCP:

Gemma-SRE: Self-Hosted vLLM Infrastructure Agent

Antigravity CLI

Antigravity CLI is the follow-on successor to Gemini CLI- the terminal driven, agent assisted coding tool.

Full details on installing Antigravity CLI are here:

Getting Started with Antigravity CLI

Testing the Antigravity CLI Environment

Once you have all the tools in place- you can test the startup of Antigravity CLI.

You will need to authenticate with a Google Cloud Project or your Google Account:

agy

This will start the interface:

Full Installation Instructions

The detailed installation instructions for Antigravity CLI are here:

Getting Started with Antigravity CLI

Python MCP Documentation

The official GitHub Repo provides samples and documentation for getting started:

GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients

Where do I start?

The strategy for starting MCP development for model management is a incremental step by step approach.

First, the basic development environment is setup with the required system variables, and a working Antigravity CLI configuration.

Then, a minimal Python MCP Server is built with stdio transport. This server is validated with Antigravity CLI in the local environment.

This setup validates the connection from Antigravity CLI to the local server via MCP. The MCP client (Antigravity CLI) and the Python MCP server both run in the same local environment.

Setup the Basic Environment

At this point you should have a working Python environment and a working Antigravity CLI installation. The next step is to clone the GitHub samples repository with support scripts:

cd ~
git clone https://github.com/xbill9/gemma4-tips

Then run init.sh from the cloned directory.

The script will attempt to determine your shell environment and set the correct variables:

gpu-31B-6000-devops-agent
source init.sh

If your session times out or you need to re-authenticate- you can run the set_env.sh script to reset your environment variables:

gpu-31B-6000-devops-agent
source set_env.sh

Variables like PROJECT_ID need to be setup for use in the various build scripts- so the set_env script can be used to reset the environment if you time-out.

Model Management Tool with MCP Stdio Transport

One of the key features that the standard MCP libraries provide is abstracting various transport methods.

The high level MCP tool implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.

The connection over stdio will look similar to this:

# Initialize FastMCP server
mcp = FastMCP("Self-Hosted vLLM DevOps Agent")

Running the Python Code

First- switch the directory with the Python version of the MCP sample code:

~/gemma4-tips/gpu-31B-6000-devops-agent

Run the release version on the local system:

make install
Processing ./.

The project can also be linted:

xbill@penguin:~/gemma4-tips/gpu-31B-6000-devops-agent$ make lint
ruff check .
All checks passed!
ruff format --check .
6 files already formatted
mypy .
Success: no issues found in 6 source files

And a test run:

xbill@penguin:~/gemma4-tips/gpu-31B-6000-devops-agent$ make test
python test_agent.py
2026-05-29 16:36:41,800 - vllm-devops-agent - INFO - Initializing DevOps Agent MCP Server...
.........2026-05-29 16:36:41,850 - vllm-devops-agent - INFO - Querying Cloud Run model with prompt: 'Hello...'
2026-05-29 16:36:41,850 - vllm-devops-agent - INFO - Model response: 'Response from Gemma...'
.2026-05-29 16:36:41,851 - vllm-devops-agent - INFO - Querying model with stats with prompt: 'Hello...'
2026-05-29 16:36:41,851 - vllm-devops-agent - INFO - Model response with stats: TTFT=0.000s, TotalTime=0.000s
.......
----------------------------------------------------------------------
Ran 17 tests in 0.027s

OK
xbill@penguin:~/gemma4-tips/gpu-31B-6000-devops-agent$

MCP stdio Transport

One of the key features that the MCP protocol provides is abstracting various transport methods.

The high level tool MCP implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.

In this project Antigravity CLI is used as the MCP client to interact with the Python MCP server code.

Antigravity CLI mcp_config.json

A sample MCP server file is provided in the .agents directory:

{
  "mcpServers": {
    "gpu-31b-6000-devops-agent": {
      "command": "python3",
      "args": [
        "/home/xbill/gemma4-tips/gpu-31B-6000-devops-agent/server.py"
      ],
      "env": {
        "GOOGLE_CLOUD_PROJECT": "aisprint-491218",
        "GOOGLE_CLOUD_LOCATION": "us-central1",
        "VLLM_BASE_URL": "https://gpu-31B-6000-devops-agent-289270257791.us-central1.run.app",
        "MODEL_NAME": "/mnt/models/gemma-4-31B-it"
      }
    }
  }
}

Validation with Antigravity CLI

The final connection test uses Antigravity CLI as a MCP client with the Python code providing the MCP server:

MCP Servers

Plugins (~/.gemini/antigravity-cli/plugins)
> ✓ google-dev-knowledge Tools: search_documents, answer_query, get_documents
   ✓ gpu-31b-6000-devops-agent Tools: save_hf_token, get_vllm_endpoint, list_vertex_models, list_bucket_models,
                                analyze_cloud_logging, +19 more

Getting Started with Gemma 4 on GPU

The Official vLLM repo also has Gemma4 specific information:

Releases · vllm-project/vllm

Model Lifecycle Management via MCP

The MCP tools provide a complete suite of agent-oriented operations for managing vLLM deployment on Cloud Run or a TPU.

Overview of MCP tools :

> list mcp tools

● gpu-31b-6000-devops-agent/get_help(Get help details from agent) (ctrl+o to expand)

  Here is the list of available Model Context Protocol (MCP) tools provided by the gpu-31b-6000-devops-agent server:

  ### 🐳 Infrastructure & Deployment

  • deploy_vllm : Deploys vLLM to Cloud Run GPU (NVIDIA RTX PRO 6000 in us-central1 ).
  • destroy_vllm : Deletes the Cloud Run vLLM service.
  • status_vllm : Checks the status of the Cloud Run vLLM service.
  • update_vllm_scaling : Updates min/max instances for scaling.
  • get_vllm_deployment_config : Generates the gcloud deployment command.
  • get_vllm_gpu_deployment_config : Generates a GKE manifest for GPU (NVIDIA L4).
  • check_gpu_quotas : Checks RTX 6000 and other GPU quotas for a region.

  ### 📦 Model Management

  • list_vertex_models : Lists models in the Vertex AI Registry.
  • list_bucket_models : Lists model weights in the GCS bucket.
  • save_hf_token : Securely saves a Hugging Face API token to Secret Manager.
  • get_vertex_ai_model_copy_instructions : Returns instructions to copy a model from Vertex AI Model Garden to GCS.
  • get_huggingface_model_copy_instructions : Returns instructions to download a model from Hugging Face and upload it
  to GCS.
  • get_huggingfacehub_download_path : Resolves the local cache path using huggingface_hub .

  ### 📊 Monitoring & Status

  • get_system_status : Provides a high-level status dashboard of the Cloud Run service and health.
  • get_endpoint : Verifies connectivity and returns the active service URL.
  • get_model_details : Retrieves detailed model metadata and engine state from /v1/models .
  • verify_model_health : Deep health check by querying the model with a simple prompt and measuring latency.

  ### 📈 Performance & Benchmarking

  • run_benchmark : Runs performance/concurrency benchmark sweeps against the Cloud Run vLLM GPU endpoint.

  ### 💬 Interaction & Diagnostics

  • query_gemma4 : Primary tool to query the self-hosted model with standard chat message format.
  • query_gemma4_with_stats : Queries the model and returns streaming performance statistics (TTFT, throughput).
  • query_vllm : Direct text completions querying tool.
  • analyze_cloud_logging : Fetches logs from GCP Logging and analyzes them using the model.
  • analyze_gpu_logs : Fetches Cloud Run logs and uses Gemma 4 to analyze them for SRE/DevOps errors.
  • suggest_sre_remediation : Suggests remediation plans for SRE errors using the model.
  • get_help : Retrieves the help menu, active environment configuration, and tool mappings.

Deploy The Model

The MCP server tools are used to deploy the model to Cloud Run with TPU:

  The compilation step and warm-up run have successfully completed:

  • torch.compile took 67.21 seconds.
  • FlashInfer JIT autotuning completed.
  • CUDA graph memory profiling is complete (5/5 CUDA graphs captured).

  The engine is now starting up the OpenAI API router. Port 8080 will be open and the revision will be marked active
  within seconds.

● Bash(make query PROMPT="Hello Gemma 4!") (ctrl+o to expand)

  I have initiated a test query ( make query ) against the new endpoint to verify the vLLM serving stack is fully
  functional and responding to prompts. I'll notify you as soon as the response arrives.

  The deployment is fully complete and functional!

  Here is the final summary:

  • Model Copy: All weights and tokenizer configurations for google/gemma-4-31B-it (totaling 58.28 GiB) were
  successfully transferred from Hugging Face directly to your Google Cloud Storage bucket ( gs://aisprint-491218-      
  bucket/gemma-4-31B-it/ ).
  • Deployment: The Cloud Run GPU service was successfully deployed under the name gpu-31b-6000-devops-agent pointing
  to the newly uploaded 31B model folder.
  • Initialization: The vLLM serving stack initialized, loaded the 31B model using the Model Streamer, successfully
  performed torch.compile graph optimization, and compiled CUDA graphs.
  • Verification: A test query was sent to the deployed Cloud Run endpoint and returned a successful completions
  response structure showing full connectivity and model execution.

  The endpoint is live and serving traffic at:
   https://gpu-31b-6000-devops-agent-289270257791.us-central1.run.app

Checking System status

The status can be checked with an MCP tool:

> get_system_status

● gpu-31b-6000-devops-agent/get_system_status(Retrieve system status details) (ctrl+o to expand)

  The system status has been successfully retrieved:

  ### 🌀 GPU Cloud Run System Status

  • vLLM Health: 🟢 Online (vLLM Endpoint https://gpu-31B-6000-devops-agent-289270257791.us-central1.run.app)
  • Cloud Run Service Status: 🟢 Ready

  Next Step: You can use the query_gemma4 tool to interact with the model.

Cross Check The Deployed Model

Once the model starts locally- the MCP tool allows for model verification:

> verify_model_health

● gpu-31b-6000-devops-agent/verify_model_health(Verify model health status) (ctrl+o to expand)

  The model health check has passed successfully:

  • Status: ✅ PASSED
  • Model: gpu-31b-6000-devops-agent 
  • Response: "Yes, I am working! How can I help you today?..."
  • Latency: 0.44 seconds

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>

and model details:

> get model details

● gpu-31b-6000-devops-agent/get_model_details(Retrieve details about the serving model) (ctrl+o to expand)

  The model details have been retrieved successfully:

  ### 🧩 Model Details (vLLM Endpoint https://gpu-31B-6000-devops-agent-289270257791.us-central1.run.app)

  Model Information ( /v1/models ):

    [
      {
        "id": "gpu-31b-6000-devops-agent",
        "object": "model",
        "owned_by": "vllm"
      }
    ]

  Health Status ( /health ):

  • Status: Healthy ✅

and the endpoint:

> get endpoint

● gpu-31b-6000-devops-agent/get_endpoint(Retrieve the endpoint URL of the serving stack) (ctrl+o to expand)

  The Cloud Run vLLM endpoint is online at:

  🟢 Endpoint URL: https://gpu-31B-6000-devops-agent-289270257791.us-central1.run.app

Check the Cloud Run deployment status:

> show the cloud run docker status

● gpu-31b-6000-devops-agent/status_vllm(Get status of vLLM Cloud Run service)
● Search(Search for status_vllm)
● Read(/home/xbill/gemma4-tips/gpu-31B-6000-devops-agent/server.py)
● Bash(gcloud run services describe gpu-31b-6000-devops-agent --project=aisprint-491218 --region=us-c...) (ctrl+o to
expand)

  The container (Docker) configuration and status for the Cloud Run service gpu-31b-6000-devops-agent are:

  • Docker Image: vllm/vllm-openai:latest 
  • Port: 8080 (HTTP)
  • Resource Limits:
      • CPU: 20 Cores
      • Memory: 80 GiB
      • GPU: 1 NVIDIA GPU (RTX Pro 6000)
  • Volume Mounts: /mnt/models mounted to model-volume 
  • Startup Probe: Starts checking after 240 seconds, with a 15-second period and 120 failure threshold (designed to
  accommodate large model loading time).
  • Entrypoint / Args:
    bash -c "vllm serve /mnt/models/gemma-4-31B-it \
      --served-model-name gpu-31b-6000-devops-agent \
      --enable-log-requests \
      --enable-chunked-prefill \
      --enable-prefix-caching \
      --generation-config auto \
      --enable-auto-tool-choice \
      --tool-call-parser gemma4 \
      --reasoning-parser gemma4 \
      --dtype bfloat16 \
      --quantization fp8 \
      --kv-cache-dtype fp8 \
      --max-num-seqs 8 \
      --gpu-memory-utilization 0.95 \
      --tensor-parallel-size 1 \
      --load-format runai_streamer \
      --port 8080 \
      --host 0.0.0.0 \
      --max-model-len 32767"

Check the Cloud Run Service

Once Deployed — the model is visible as a standard Cloud Run Service:

Review the Model

Check the model details:

> what are the model details?

● gpu-31b-6000-devops-agent/get_model_details(Retrieve details about the serving model) (ctrl+o to expand)

  The credentials have been refreshed successfully! Here are the model details:

  ### 🧩 Model Details (vLLM Endpoint https://gpu-31B-6000-devops-agent-289270257791.us-central1.run.app)

  Model Information ( /v1/models ):

    [
      {
        "id": "gpu-31b-6000-devops-agent",
        "object": "model",
        "owned_by": "vllm"
      }
    ]

  Health Status ( /health ):

  • Status: Healthy ✅

Once everything is online — you can send queries to the Model:

> query_gemma4_with_stats what is gemma?

● gpu-31b-6000-devops-agent/query_gemma4_with_stats(Query model for Gemma info and return statistics) (ctrl+o to expand)

  Here are the performance statistics and the model's response:

  ### 📊 Performance Stats

  • Model: gpu-31b-6000-devops-agent 
  • Time to First Token (TTFT): 0.093s 
  • Total Generation Time: 15.658s 
  • Tokens per Second: 39.13 tokens/s 
  • Total Tokens (approx.): 609 

  ### 💬 Model Response

  Gemma is a family of lightweight, state-of-the-art open-weights large language models (LLMs) developed by Google     
  DeepMind.

  Think of Gemma as the "smaller, open sibling" of Gemini, Google's most powerful AI. While Gemini is a closed system
  available via API or chat interface, Gemma is designed to be downloaded and run on your own hardware.

  Here is a detailed breakdown of what makes Gemma unique:

  ### 1. "Open-Weights" vs. "Open Source"

  It is important to note that Gemma is open-weights, not fully open-source.

  • Open-Source usually means the entire training data and code are public.
  • Open-Weights means Google provides the "brain" (the trained parameters) for free, allowing developers to run it
  locally and fine-tune it for specific tasks, but the exact training data remains proprietary.

  ### 2. Relationship to Gemini

  Gemma was built using the same research, technology, and infrastructure used to create the Gemini models. This means
  it inherits much of the reasoning, coding, and mathematics capability of the larger Gemini models, but in a package
  that is small enough to fit on a personal laptop or a workstation.

  ### 3. Key Versions

  Gemma typically comes in different sizes to suit different hardware:

  • Gemma 2B: Extremely small; can run on a mobile phone or a basic laptop.
  • Gemma 9B / 27B: Mid-sized models that offer a balance between high intelligence and performance. The 27B version is
  designed to compete with much larger models (like Llama 3) while remaining efficient.

  ### 4. What can you do with Gemma?

  Because you can host Gemma on your own machine, it is popular for:

  • Privacy: Processing sensitive data without sending it to a cloud server.
  • Customization (Fine-Tuning): A developer can take Gemma and "teach" it a specific medical dataset or a company's
  internal documentation to make it an expert in a niche field.
  • Research: Academics use it to study how LLMs work without needing a supercomputer.
  • Edge Computing: Running AI locally on devices that don't have a constant internet connection.

  ### 5. How does it compare to others?

  Gemma is Google's direct competitor to Meta's Llama (e.g., Llama 3) and Mistral AI's Mistral/Mixtral models. All
  three are fighting to be the "standard" model that developers use to build their own AI applications.

  In short: Gemma is Google's high-performance, portable AI model that allows developers to build and run their own AI 
  apps privately and locally.

Benchmark the Model

The model was benchmarked:

The benchmark sweep has completed successfully. Here are the compiled results across all combinations of context
  window sizes and concurrency levels:

  The benchmark sweep across 1 to 256 concurrent users and 8 to 16,384 context tokens has successfully completed.

  ### 📂 Generated Files

  • Performance Report: benchmark_report.md
  • Raw CSV Results: matrix_benchmark_results.csv
  • Performance Chart: rtx_6000_benchmark.png
  • Conversation Artifact: benchmark_report.md

  ### 📈 Key Results Summary

  • Continuous Batching Efficiency: The vLLM engine scales exceptionally well under load. Under maximum load (256      
  concurrent users with 16,384 context window), the token generation rate peaked at ~344.96 tokens/sec.
  • Latency Profile: Latency is highly predictable, starting at sub-second/single-digit latency for lower concurrency
  profiles, scaling linearly to approximately 41.43s for 256 concurrent users at 16,384 tokens.
  • Warmup/Cold Starts: During the transition to the 512 token context sweep, a container restart and graph
  compilation occurred, resulting in temporarily lower throughput before graph caching took over and returned the
  system to optimal speed.

The analysis is reviewed:

  ### 🔑 Key Takeaways

  1. Continuous Batching Efficiency ($C=16$ Sweet Spot):
  Scaling concurrency from 1 to 16 increases total token throughput by almost 8x (jumping from ~38 tok/s to ~300
  tok/s) while average latency only increases slightly (from 3.32s to 5.17s).
  2. Throughput Saturation:
  The RTX 6000 Pro's token throughput caps out at ~310–345 tokens/sec. Pushing concurrency past 32 up to 256 does
  not yield further throughput gains, but instead results in linear queuing latency (latency reaches up to ~54.87s).
  3. Prefix Caching Phenomenon (16,384 tokens):
  Under high load, large context windows (16k tokens) actually completed faster than short context windows (avg latency
  of 41.43s vs 54.87s at 256 concurrency). This is due to vLLM's automatic prefix caching which evaluates the large
  prompt KV cache once and shares it, reducing prefill time to $O(1)$ for subsequent concurrent requests.

  ### 🛠️ SRE Configuration Recommendations

  • Cloud Run Concurrency: Limit instance concurrency to --concurrency=16 or --concurrency=32 to trigger auto-
  scaling to new instances rather than queuing on a single saturated GPU.
  • Prefix Caching: Always run with the --enable-prefix-caching flag enabled to optimize shared prompt contexts (e.g.
  system instructions, repeated log dumps).
  • Startup Delays: Set startup timeouts to at least 240 seconds to accommodate container cold-starts, safetensors
  downloading, and torch.compile graph optimization.

This is visualized:

Compare to TPU Deployment

The same model was deployed on a v6e-4 TPU for comparison:

  I have generated a side-by-side performance comparison between the RTX 6000 Pro GPU and Cloud TPU v6e-4.

  📂 Detailed Comparison Report: gpu_tpu_comparison_report.md
  🖼️ Visual Comparison Chart: gpu_tpu_comparison.png

  ### 📊 Key Performance Differences

  1. Peak Throughput (TPU Dominance):
      • For short contexts (8 tokens), TPU v6e-4 reaches ~1,379 tokens/sec compared to the GPU's ~308 tokens/sec (4.5x 
      faster).
      • For long contexts (16,384 tokens), TPU v6e-4 achieves ~3,206 tokens/sec compared to the GPU's ~344 tokens/sec
      (9.3x faster).
  2. Latency (Sub-Second Responses):
      • TPU latency stays extremely low under medium load (averaging 0.46s at concurrency 16), whereas the GPU averages
      5.17s for the same level.
  3. Architectural Fit:
      • NVIDIA RTX 6000 Pro GPU (Cloud Run): Perfect for cost-sensitive, low-to-medium concurrency apps, or workflows
      with high idle periods (leveraging Cloud Run's scale-to-zero model).
      • Cloud TPU v6e-4: Essential for high-traffic enterprise gateways, real-time streaming, and massive batch log-
      parsing operations where high context and high concurrency are standard.

📂 Architectural Deep Dive Report: gpu_tpu_deep_dive.md

  ### 🔬 Core Hardware Findings

  1. Memory Bandwidth Bottleneck (Decode Phase):
  During token generation (decode), execution is entirely memory-bandwidth bound because the model parameters must be
  read from VRAM for every single token. The TPU v6e-4's HBM2e memory supplies ~6.4 TB/s of bandwidth compared to the
  GPU's 960 GB/s (GDDR6). This 6.6x advantage is why the TPU handles high concurrency without saturating.
  2. Distributed Processing (Tensor Parallelism):
  The TPU v6e-4 config splits the 31B model across 4 chips ( --tensor-parallel-size 4 ) communicating over an ultra-low
  latency Inter-Chip Interconnect (ICI). The RTX 6000 Pro runs on a single card ( --tensor-parallel-size 1 ),
  processing all weights on a single GPU core.
  3. Compiler Efficiency (XLA vs. CUDA):
  TPUs compile the model graph using XLA (Accelerated Linear Algebra), which optimizes matrix operations and minimizes
  operator execution overhead. The GPU relies on CUDA Graphs and torch.compile which, while fast, has higher startup
  graph compilation times and launch latencies.

The results were graphed:

and recommendations generated:

  📂 Recommendations Document: production_recommendations.md

  ### 🚀 Summary of Key Actions

  1. Hardware Strategy:
      • Use RTX 6000 Pro (Cloud Run) for SRE troubleshooting tools, development, and low-traffic applications where
      scale-to-zero saves significant cost.
      • Use Cloud TPU v6e-4 for high-volume enterprise APIs or systems handling massive context windows (16,384+
      tokens) at high concurrency.
  2. Cloud Run Tuning:
      • Set --concurrency=16 or --concurrency=32 to avoid saturating a single GPU, directing excess traffic to auto-
      scaled new instances.
      • Configure --timeout=300 and TCP startup probes to at least 240 seconds to allow for weight streaming and
      graph optimization.
  3. Prefix Caching Alignment:
      • Keep --enable-prefix-caching active.
      • Organize prompts from static (system parameters, database schemas) to dynamic (user query, timestamped logs) to
      maximize prefix cache hit ratios.

Finally — a basic cost analsys:

  I have performed a cost-per-token analysis comparing the NVIDIA RTX 6000 Pro GPU (Cloud Run) to the Cloud TPU v6e-4
  based on GCP's resource pricing models.

  📂 Detailed Cost Analysis Report: cost_per_token_analysis.md

  ### 📊 Cost-per-Million Tokens Comparison (at Concurrency = 16)

  1. Short Context (8 Input Tokens):
      • RTX 6000 Pro GPU: $3.49 per million tokens
      • Cloud TPU v6e-4: $4.99 per million tokens
      • Verdict: GPU is 30% cheaper because the machine-hour cost is lower ($3.76/hr vs $5.50/hr).
  2. Long Context (16,384 Input Tokens):
      • RTX 6000 Pro GPU: $4.24 per million tokens
      • Cloud TPU v6e-4: $0.53 per million tokens
      • Verdict: TPU is 8x cheaper because the TPU v6e-4 leverages HBM2e memory bandwidth to maintain massive
      throughput (2,892 tokens/sec vs 246 tokens/sec on GPU).

  ### 💡 SRE Recommendation

  • Spiky/Ad-Hoc Workloads (e.g. On-demand log diagnostics): Use the GPU on Cloud Run. It supports scale-to-zero,
  meaning you only pay the $3.76/hr rate while processing requests, saving substantial money over a TPU VM which must
  run constantly (charging $132/day even when idle).
  • Continuous/High-Throughput Production Workloads: Use the Cloud TPU v6e-4. Running TPU instances continuously is
  significantly more cost-efficient for heavy text processing and long-context analysis.

Summary

The strategy for using MCP for Gemma 4 GPU deployment with Antigravity CLI was validated with a incremental step by step approach.

Gemma 4 Benchmarking NVIDIA Blackwell RTX 6000 vs L4 on Google Cloud Run

xbill — Fri, 29 May 2026 16:47:25 +0000

What is this project trying to Do?

This project is similar to a previous project that targeted GPU hosted Gemma4 instances on GCP:

Gemma-SRE: Self-Hosted vLLM Infrastructure Agent

Antigravity CLI

Antigravity CLI is the follow-on successor to Gemini CLI- the terminal driven, agent assisted coding tool.

Full details on installing Antigravity CLI are here:

Getting Started with Antigravity CLI

Testing the Antigravity CLI Environment

Once you have all the tools in place- you can test the startup of Antigravity CLI.

You will need to authenticate with a Google Cloud Project or your Google Account:

agy

This will start the interface:

Full Installation Instructions

The detailed installation instructions for Antigravity CLI are here:

Getting Started with Antigravity CLI

Python MCP Documentation

The official GitHub Repo provides samples and documentation for getting started:

GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients

Where do I start?

The strategy for starting MCP development for model management is a incremental step by step approach.

First, the basic development environment is setup with the required system variables, and a working Antigravity CLI configuration.

Then, a minimal Python MCP Server is built with stdio transport. This server is validated with Antigravity CLI in the local environment.

This setup validates the connection from Antigravity CLI to the local server via MCP. The MCP client (Antigravity CLI) and the Python MCP server both run in the same local environment.

Setup the Basic Environment

At this point you should have a working Python environment and a working Antigravity CLI installation. The next step is to clone the GitHub samples repository with support scripts:

cd ~
git clone https://github.com/xbill9/gemma4-tips

Then run init.sh from the cloned directory.

The script will attempt to determine your shell environment and set the correct variables:

gpu-6000-devops-agent
source init.sh

If your session times out or you need to re-authenticate- you can run the set_env.sh script to reset your environment variables:

gpu-6000-devops-agent
source set_env.sh

Variables like PROJECT_ID need to be setup for use in the various build scripts- so the set_env script can be used to reset the environment if you time-out.

Model Management Tool with MCP Stdio Transport

One of the key features that the standard MCP libraries provide is abstracting various transport methods.

The high level MCP tool implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.

The connection over stdio will look similar to this:

# Initialize FastMCP server
mcp = FastMCP("Self-Hosted vLLM DevOps Agent")

Running the Python Code

First- switch the directory with the Python version of the MCP sample code:

~/gemma4-tips/gpu-6000-devops-agent

Run the release version on the local system:

make install
Processing ./.

The project can also be linted:

xbill@penguin:~/gemma4-tips/gpu-6000-devops-agent$ make lint
ruff check .
All checks passed!
ruff format --check .
6 files already formatted
mypy .
Success: no issues found in 6 source files

And a test run:

xbill@penguin:~/gemma4-tips/gpu-6000-devops-agent$ make test
python test_agent.py
2026-05-28 17:33:37,347 - vllm-devops-agent - INFO - Initializing DevOps Agent MCP Server...
.........2026-05-28 17:33:37,388 - vllm-devops-agent - INFO - Querying Cloud Run model with prompt: 'Hello...'
2026-05-28 17:33:37,388 - vllm-devops-agent - INFO - Model response: 'Response from Gemma...'
.2026-05-28 17:33:37,389 - vllm-devops-agent - INFO - Querying model with stats with prompt: 'Hello...'
2026-05-28 17:33:37,389 - vllm-devops-agent - INFO - Model response with stats: TTFT=0.000s, TotalTime=0.000s
.......
----------------------------------------------------------------------
Ran 17 tests in 0.022s

MCP stdio Transport

One of the key features that the MCP protocol provides is abstracting various transport methods.

The high level tool MCP implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.

In this project Antigravity CLI is used as the MCP client to interact with the Python MCP server code.

Antigravity CLI mcp_config.json

A sample MCP server file is provided in the .agents directory:

{
  "mcpServers": {
    "gpu-6000-devops-agent": {
      "command": "python3",
      "args": [
        "/home/xbill/gemma4-tips/gpu-6000-devops-agent/server.py"
      ],
      "env": {
        "GOOGLE_CLOUD_PROJECT": "aisprint-491218",
        "GOOGLE_CLOUD_LOCATION": "us-central1",
        "VLLM_BASE_URL": "https://gpu-6000-devops-agent-289270257791.us-central1.run.app",
        "MODEL_NAME": "/mnt/models/gemma-4-E4B-it"
      }
    }
  }
}

Validation with Antigravity CLI

The final connection test uses Antigravity CLI as a MCP client with the Python code providing the MCP server:

MCP Servers

Plugins (~/.gemini/antigravity-cli/plugins)
> ✓ google-dev-knowledge Tools: search_documents, answer_query, get_documents
   ✓ gpu-6000-devops-agent Tools: save_hf_token, get_vllm_endpoint, list_vertex_models, list_bucket_models,
                            analyze_cloud_logging, +20 more

Getting Started with Gemma 4 on GPU

Cloud Run GPU is tied to quota limits:

> check cloud run quota in us-central1

This will return the current quotas:

The Cloud Run ( run.googleapis.com ) GPU quotas for us-central1 in your project ( aisprint-491218 ) are as follows:

  ### 1. NVIDIA RTX Pro 6000 GPU Allocation

  • Metric Name: Total Nvidia RTX Pro 6000 GPU allocation without zonal redundancy 
  • Metric Identifier: run.googleapis.com/nvidia_rtx_pro_6000_gpu_allocation_no_zonal_redundancy 
  • Effective Limit: 3000 in us-central1 (due to a producer override set to 3000 ).

You can also use the Antigravity CLI to research a quota increase:

> request RTX Pro 6000 GPUs cloud run quota of 3 in us-central1 with zonal redundancy

● Bash(gcloud alpha services quota --help || gcloud consumer-quota-metrics --help)
● Bash(gcloud alpha services quota create --help)
● Bash(gcloud alpha services quota create --service=run.googleapis.com --consumer=projects/aisprint-4...) (ctrl+o to
expand)

  The request to set a consumer override of 3 for RTX Pro 6000 GPUs with zonal redundancy in us-central1 returned a
  precondition failure:

    ERROR: (gcloud.alpha.services.quota.create) FAILED_PRECONDITION: The consumer override value can only be set
  between 0 to 0.
    COMMON_QUOTA_CONSUMER_OVERRIDE_TOO_HIGH

  ### Explanation

  This error indicates that the project currently has a producer-defined maximum limit of 0 for this regional metric.
  Consequently, self-service quota increases via gcloud overrides are restricted to 0 .

  To obtain this quota, you must request a quota increase (which raises the producer limit) through the Google Cloud
  Console:

  1. Navigate to the IAM & Admin > Quotas page.
  2. Filter for the metric: run.googleapis.com/nvidia_rtx_pro_6000_gpu_allocation .
  3. Select the us-central1 region and click Edit Quotas to submit a formal request to Google Cloud Support.

The Official vLLM repo also has Gemma4 specific information:

Releases · vllm-project/vllm

Model Lifecycle Management via MCP

The MCP tools provide a complete suite of agent-oriented operations for managing vLLM deployment on Cloud Run or a TPU.

Overview of MCP tools :

> get mcp tools

● gpu-6000-devops-agent/get_help(MCP Help Retrieval) (ctrl+o to expand)

  Here are the MCP tools available from the gpu-6000-devops-agent server:

  ### 🐳 Infrastructure & Deployment

  • deploy_vllm : Deploys vLLM to Cloud Run GPU.
  • destroy_vllm : Deletes the Cloud Run vLLM service.
  • status_vllm : Checks the status of the Cloud Run vLLM service.
  • update_vllm_scaling : Updates min/max instances for scaling.
  • get_vllm_deployment_config : Generates the gcloud deployment command.
  • get_vllm_gpu_deployment_config : Generates a GKE manifest for GPU (NVIDIA L4).
  • check_gpu_quotas : Checks L4 and other GPU quotas for a region.

  ### 📊 Model Management

  • list_vertex_models : Lists models in the Vertex AI Registry.
  • list_bucket_models : Lists model weights in GCS bucket.
  • save_hf_token : Securely saves a Hugging Face API token to Secret Manager.
  • get_vertex_ai_model_copy_instructions : Instructions to copy model from Vertex AI Model Garden to GCS.
  • get_huggingface_model_copy_instructions : Instructions to download model from Hugging Face and upload to GCS.
  • get_huggingfacehub_download_path : Resolves local cache path using huggingface_hub .

  ### 📊 Monitoring & Status

  • get_system_status : Provides a high-level status dashboard of the Cloud Run service and health.
  • get_endpoint : Verifies connectivity and returns the active service URL.
  • get_model_details : Retrieves detailed model metadata and engine state from /v1/models .
  • verify_model_health : Deep health check by querying the model with a simple prompt and measuring latency.

  ### 📈 Performance & Benchmarking

  • run_benchmark : Runs performance/concurrency benchmark sweeps against the Cloud Run vLLM GPU endpoint.

  ### 💬 Interaction & Diagnostics

  • query_gemma4 : Primary tool to query the self-hosted model with standard chat message format.
  • query_gemma4_with_stats : Queries the model and returns streaming performance statistics (TTFT, throughput).
  • query_vllm : Direct text completions querying tool.
  • analyze_cloud_logging : Fetches logs from GCP Logging and analyzes them using the model.
  • analyze_gpu_logs : Fetches Cloud Run logs and uses Gemma 4 to analyze them for SRE/DevOps errors.
  • suggest_sre_remediation : Suggests remediation plans for SRE errors using the model.

Deploy The Model

The MCP server tools are used to deploy the model to Cloud Run with TPU:

deploy the model
 ### Deployment Status Summary

The Gemma 4 model has been successfully deployed to Google Cloud Run in us-central1 using the NVIDIA RTX 6000 (
  nvidia-rtx-pro-6000 ) GPU accelerator!

Checking System status

The status can be checked with an MCP tool:

> get_system_status

● gpu-6000-devops-agent/get_system_status(System Status Retrieval) (ctrl+o to expand)

  Here is the current system status for the GPU Cloud Run deployment:

  • vLLM Health: 🟢 Online (https://gpu-6000-devops-agent-289270257791.us-central1.run.app)
  • Cloud Run Service Status: 🟢 Ready

  Next Step: You can use the query_gemma4 tool to interact with the model.

Cross Check The Deployed Model

Once the model starts locally- the MCP tool allows for model verification:

> verify_model_health

● gpu-6000-devops-agent/verify_model_health(Verify Model Health) (ctrl+o to expand)

  The model health check passed successfully:

  • Status: ✅ PASSED
  • Model: gpu-6000-devops-agent 
  • Response: "Hello! Yes, I am working. I am a large language mo..."
  • Latency: 0.41 seconds

and model details:

> get_active_models

● gpu-6000-devops-agent/get_model_details(Model Details Retrieval) (ctrl+o to expand)

  Here are the active model details retrieved from the endpoint (https://gpu-6000-devops-agent-289270257791.us-        
  central1.run.app):

  ### 🧩 Active Model Info ( /v1/models )

    [
      {
        "id": "gpu-6000-devops-agent",
        "object": "model",
        "owned_by": "vllm"
      }
    ]

  ### 🩺 Health Status ( /health )

  • Status: Healthy ✅

and the endpoint:

> get_endpoint

● gpu-6000-devops-agent/get_endpoint(Endpoint Retrieval) (ctrl+o to expand)

  The active endpoint is online at:

  🟢 https://gpu-6000-devops-agent-289270257791.us-central1.run.app

and status:

> get_system_details

● gpu-6000-devops-agent/status_vllm(vLLM Status Retrieval) (ctrl+o to expand)

  Here are the detailed system and service status details for the gpu-6000-devops-agent deployment on Cloud Run:

  ### ⚙️ Cloud Run Service Conditions & Status

  • Overall Status: Ready ✅
  • Configuration Status: ConfigurationsReady (True) ✅
  • Route Status: RoutesReady (True) ✅
  • Latest Created Revision: gpu-6000-devops-agent-00002-hvw 
  • Direct URL: https://gpu-6000-devops-agent-wgcq55zbfq-uc.a.run.app
  • Last Transition Time: 2026-05-29T14:07:55Z 

────────────────────────────────────────────────────────────
> which gpu is cloud run using

● gpu-6000-devops-agent/get_vllm_deployment_config(Deployment Config Retrieval) (ctrl+o to expand)

  Cloud Run is configured to use the NVIDIA RTX 6000 Ada Generation GPU (configured via --gpu-type=nvidia-rtx-pro-6000 
  with
  1 GPU unit).

Check the Cloud Run deployment status:


> show the cloud run docker status

● Bash(docker ps -a)
● gpu-6000-devops-agent/analyze_gpu_logs(GPU Log Analysis)
● Bash(gcloud run services describe gpu-6000-devops-agent --region us-central1 --
format="value(status.latestReadyRevisio...)
● Bash(gcloud run revisions describe gpu-6000-devops-agent-00002-hvw --region us-central1) (ctrl+o to expand)

  Here is the current Docker/Container configuration and status for the active revision gpu-6000-devops-agent-00002-  
  hvw on
  Cloud Run:

  ### 🐳 Container & Image Information

  • Container Name: vllm-openai-1 
  • Docker Image: mirror.gcr.io/vllm/vllm-                                                                            
  openai@sha256:69cf768308bad3a6fde6ffeffc8ba1f28433752c01e9cb53f57bfaf547cec2e1 
  • Entrypoint Command: bash 
  • Arguments:
    -c vllm serve /mnt/models/gemma-4-E4B-it \
       --served-model-name gpu-6000-devops-agent \
       --enable-log-requests \
       --enable-chunked-prefill \
       --enable-prefix-caching \
       --generation-config auto \
       --enable-auto-tool-choice \
       --tool-call-parser gemma4 \
       --reasoning-parser gemma4 \
       --dtype bfloat16 \
       --quantization fp8 \
       --kv-cache-dtype fp8 \
       --max-num-seqs 8 \
       --gpu-memory-utilization 0.95 \
       --tensor-parallel-size 1 \
       --load-format runai_streamer \
       --port 8080 \
       --host 0.0.0.0 \
       --max-model-len 32767

  ### 🎛 Resource Allocation

  • CPU: 20 (Always allocated)
  • Memory: 80Gi 
  • GPU: 1 unit of NVIDIA RTX 6000 ( nvidia-rtx-pro-6000 )
  • Concurrency: 16 requests per container instance
  • Scale Limits: Min instances: 1 , Max instances: 3 

  ### 💾 Volumes

  • Model GCS Bucket: aisprint-491218-bucket mounted at /mnt/models (Read-only, uid/gid: 1001)

Check the Cloud Run Service

Once Deployed — the model is visible as a standard Cloud Run Service:

Review the Model

Check the model details:

> what are the model details?

● gpu-6000-devops-agent/list_bucket_models(GCS Bucket Model Listing) (ctrl+o to expand)

  The model currently being served is Gemma 4 E4B-it ( gemma-4-E4B-it ).

  Here are the key details of the active model:

  ### 📁 Model Storage & Location

  • Source Path (in Cloud Storage Bucket): gs://aisprint-491218-bucket/gemma-4-E4B-it/ 
  • Local Container Mount Path: /mnt/models/gemma-4-E4B-it 
  • Model Size: The primary weights file ( model.safetensors ) is 15.25 GB (15,251.73 MB).

  ### ⚙️ vLLM serving configuration:

  • Quantization: fp8 
  • KV Cache Dtype: fp8 
  • Max Model Length: 32,767 tokens
  • Tensor Parallel Size: 1 (runs on a single GPU instance)
  • Serving Name (OpenAI endpoint ID): gpu-6000-devops-agent 

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>

Once everything is online — you can send queries to the Model:

> query_gemma4_with_stats what is gemma4?

● gpu-6000-devops-agent/query_gemma4_with_stats(Gemma 4 Stats Query) (ctrl+o to expand)

  Here are the performance stats and response from the model:

  ### 📊 Performance Stats

  • Model: gpu-6000-devops-agent 
  • Time to First Token (TTFT): 0.093s 
  • Total Generation Time: 0.570s 
  • Throughput: 161.29 tokens/s 
  • Total Tokens Generated: ~77

  ### 💬 Model Response

  │ I am Gemma 4.
  │
  │ I am a Large Language Model developed by Google DeepMind. I am an open weights model, which means my underlying
  │ architecture and weights are available for the community to use and build upon.
  │
  │ In essence, I am a sophisticated AI designed to understand and generate human-like text based on the vast amounts
  of
  │ data I was trained on.

Benchmark the Local Model

The local model was benchmarked:

  The matrix benchmark sweep has completed successfully! All 96 execution cells (concurrency levels from 1 to 128 
  across context window sizes from 8 to 16,384 tokens) were swept.

  I've generated a detailed summary and findings report in benchmark_results.md.

  ### 🚀 Key Performance Highlights

  • Stability: 100% Success Rate across all concurrent loads and context sizes.
  • Peak Throughput: Reached ~350–365 tokens/sec at high concurrencies for small context windows (8–32 tokens).
  • Deep Context Scaling: Maintained ~21–25 tokens/sec at the maximum 16,384 context window size under a high load of  
  128 concurrent users.
  • Latency: The average latency stayed under 0.25 seconds even during maximum concurrency/context window testing.

This is visualized:

Compare L4 vs RTX 6000

The model was previously deployed on a Cloud Run L4 here:

Gemma 4 Deployment with NVIDIA L4, MCP, Cloud Run, and Antigravity CLI

There results were compared:

> compare results to L4 deployment
● Search(Find benchmark data)
● Read(/home/xbill/gemma4-tips/gpu-6000-devops-agent/benchmark_report.md) (ctrl+o to expand)

▸ Thought for 2s, 241 tokens
  Analyzing GPU Performance

● Create(/home/xbill/.gemini/antigravity-cli/brain/45...-41cd-9dea-85246ed1e468/comparison_report.md) (ctrl+o to expand)

  I have analyzed and compared your new NVIDIA RTX PRO 6000 Ada GPU benchmark results against the baseline NVIDIA L4
  GPU deployment results.

  The detailed comparative analysis is saved in the comparison_report.md artifact.

  ### 📊 Performance Summary: NVIDIA L4 vs. NVIDIA RTX PRO 6000

  • Peak Throughput: Increased from 80–90 tokens/sec on L4 to 350–365 tokens/sec on the RTX PRO 6000 (~4x throughput   
  speedup).
  • Average Latency: Dropped from ~1.0 second under load on L4 to under 0.25 seconds on the RTX PRO 6000 (~4x latency  
  reduction).
  • Context Capacity: Doubled from 16,384 tokens to 32,767 tokens (made possible by the RTX 6000's memory headroom
  combined with the configured FP8 model and KV-cache quantization).
  • Long Context Throughput (16k): Remains comparable (~21–25 tokens/sec on RTX 6000 vs. ~25–27 tokens/sec on L4), but
  with much lower latency under high user concurrency.

and a deep dive:

> deep dive into L4 vs RTX PRO 6000 benchmark results

▸ Thought for 1s, 485 tokens
  Analyzing Benchmark Results

● Create(/home/xbill/.gemini/antigravity-cli/brain/45...1cd-9dea-85246ed1e468/benchmark_deep_dive.md) (ctrl+o to expand)

  I have compiled a detailed, systems-level benchmark comparison analyzing how the hardware specifications of the
  NVIDIA
  L4 and NVIDIA RTX PRO 6000 Blackwell translate directly into the observed benchmark behaviors.

  The complete report is available in the benchmark_deep_dive.md artifact.

  ### 🔍 Summary of Key Architectural Explanations

  1. Memory Bandwidth Bottleneck (Autoregressive Decoding):
      • Observation: The L4 baseline flattens out around 80–90 tokens/sec at concurrency levels above 16. The RTX 6000
      Blackwell scales up to 365 tokens/sec at 128 concurrent users.
      • Explanation: The autoregressive decode phase is memory-bandwidth bound. The RTX 6000 Blackwell's 960 GB/s GDDR7
      bandwidth is 3.2x faster than the L4's 300 GB/s, allowing vLLM to process larger batches without memory stalls.
  2. Compute-Bound Prefill (Time to First Token & Latency):
      • Observation: Under high load, average latency on the L4 climbs to ~1.0 second, whereas it stays flat under 0.25
      seconds on the RTX 6000 Blackwell.
      • Explanation: Prompt processing (prefill) is compute-bound. The RTX 6000 Blackwell features 24,064 CUDA cores
      and 1,457 TFLOPS of FP8 performance (vs. 7,424 cores and 242 TFLOPS on L4), computing prompt matrix
      multiplications almost instantaneously.
  3. VRAM Headroom for Deep Context (16k vs. 32k):
      • Observation: The L4 hits a hard limit at a 16,384 context window. The RTX 6000 Blackwell scales up to 32,767   
      tokens with high concurrency.
      • Explanation: Large context windows require massive VRAM pools to store KV-caches. The RTX 6000 Blackwell's 96  
      GB VRAM easily accommodates the FP8-quantized model and deep KV-caches, maintaining a stable throughput of ~24.66
      tokens/sec at the maximum 32,767 limit.

and graphed:

The Final Verdict? Surprise!

A full cost analysis was done:

  Analyzing Compute Costs
  To compare the cost efficiency of the NVIDIA L4 vs. the NVIDIA RTX PRO 6000 for serving the Gemma 4 (4B-it) model, we
  break down Google Cloud Run resource pricing (Tier 1 regions) and measure it against the benchmarked throughput.
  ──────
  ### 1. Google Cloud Run Resource Rates (Tier 1 Regions)

  To utilize GPUs on Cloud Run, the CPU and memory must remain allocated during the instance lifecycle.

   Billing Component | Unit Rate (CPU Always Allocated)
  ----------------------------------------------------------|----------------------------------------------------------
   vCPU | $0.00001800 per vCPU-second
   Memory | $0.00000200 per GiB-second
   NVIDIA L4 GPU (No Zonal Redundancy) | $0.00018670 per GPU-second (~$0.67/hr)
   NVIDIA RTX PRO 6000 | $0.00079170 per GPU-second (~$2.85/hr)
  ──────
  ### 2. Minimum Stack Configurations & Hourly Costs

   Stack Component | 🟢 NVIDIA L4 Configuration | 🚀 NVIDIA RTX PRO 6000 Configuration
  ---------------------------------------|--------------------------------------|--------------------------------------
   GPU | 1x NVIDIA L4 ($0.672/hr) | 1x NVIDIA RTX PRO 6000 ($2.85/hr)
   vCPU Allocation | 4 vCPUs ($0.259/hr) | 20 vCPUs ($1.296/hr)
   Memory Allocation | 16 GiB ($0.115/hr) | 80 GiB ($0.576/hr)
   Total Hourly Cost | $1.047 / hour ($0.000291/sec) | $4.722 / hour ($0.001312/sec)
  ──────
  ### 3. Cost Per Token Comparison (Gemma 4 4B-it)

  Using the benchmark results for both setups, we calculate the cost per 1 Million Tokens:

  #### Scenario A: Peak Throughput (Short Context)

  • 🟢 NVIDIA L4: Peak Throughput of 90 tokens/sec
      • Time to generate 1M tokens: $11,111\text{ seconds}$ (~3.09 hours)
      • Cost per 1M Tokens: $\mathbf{\$3.23}$
  • 🚀 NVIDIA RTX PRO 6000: Peak Throughput of 365 tokens/sec
      • Time to generate 1M tokens: $2,740\text{ seconds}$ (~0.76 hours)
      • Cost per 1M Tokens: $\mathbf{\$3.59}$

  │ [!NOTE]
  │ For peak throughput workloads, the NVIDIA L4 is slightly cheaper (~10% lower cost per token), but the RTX PRO 6000
  │ delivers the tokens 4x faster with 4x lower latency.

  #### Scenario B: Deep Context Boundary (16K Context Window)

  • 🟢 NVIDIA L4: Throughput of 26 tokens/sec
      • Time to generate 1M tokens: $38,462\text{ seconds}$ (~10.68 hours)
      • Cost per 1M Tokens: $\mathbf{\$11.18}$
  • 🚀 NVIDIA RTX PRO 6000: Throughput of 23 tokens/sec
      • Time to generate 1M tokens: $43,478\text{ seconds}$ (~12.08 hours)
      • Cost per 1M Tokens: $\mathbf{\$57.03}$

  ──────
  ### 🏆 Verdict: Which has the best cost per token?

  1. Overall Winner: NVIDIA L4 GPU
  The NVIDIA L4 has the best cost per token across all scenarios. At 16K context windows, it is over 5x more cost-     
  effective than the RTX 6000 because the RTX 6000 is bound by a minimum compute footprint of 20 vCPUs and 80 GiB
  memory, making its base running cost significantly higher ($4.72/hr vs. $1.05/hr) even when throughput scales down at
  long contexts.
  2. Best for Latency/High Concurrency: RTX PRO 6000
  If sub-second latency under high concurrent load is required, the RTX PRO 6000 is only ~11% more expensive per token
  for short contexts while generating tokens 4x faster.

Summary

The strategy for using MCP for Gemma 4 GPU deployment with Antigravity CLI was validated with a incremental step by step approach.

A minimal stdio transport MCP Server was started from Python source code and validated with Antigravity CLI running as a MCP client in the same local environment. These tools were used to deploy and benchmark the L4 and RTX 6000 GPUS on Google Cloud Run.

The Kubernetes Overkill: Why I Built a "K8s Killer" for Small Environments

Mario Ezquerro — Fri, 29 May 2026 09:48:30 +0000

Let’s be honest: Kubernetes is the undisputed king of orchestration. But if you are running a small to medium-sized environment, deploying K8s is often like using a sledgehammer to crack a nut. It brings massive resource overhead and operational complexity that most projects simply do not need.

Worse yet, despite its size, Kubernetes isn't actually a complete solution out of the box.

If you want proper SRE management—like native observability, health checks, and Service Level Objectives (SLOs)—you have to bolt on a complex stack of external tools. Need Ingress management to expose your services? You'll have to configure and maintain third-party Ingress controllers just to get traffic flowing.

That architectural bloat led me to advocate for a more minimalist approach. I wanted an orchestrator that provided everything necessary for robust deployments without the massive footprint.

Enter Gubernator: built to be the Kubernetes killer for small environments.

What Makes Gubernator Different?
Gubernator (gbnt) is a lightweight, distributed container orchestrator written entirely in Go. It strips away the unnecessary layers of traditional orchestration and bakes the essential operational requirements directly into its core:

Native SRE & Observability: Instead of treating SRE as an afterthought, Gubernator integrates robust health checks and observability (leveraging OpenTelemetry and Prometheus) directly into the orchestrator.

Built-in Ingress Management: No need to wrestle with external controllers. Gubernator handles your ingress routing natively, simplifying how you expose and manage your services.

True Software Minimalism: By using SQLite for data consistency rather than heavy key-value stores, Gubernator maintains a tiny resource footprint while ensuring reliable state management across multiple hosts.

Developer-Friendly Interface: A clean REST API and an intuitive CLI tool get your containers running securely and efficiently.

If you are tired of the Kubernetes learning curve and the overhead of maintaining tools you barely use, it's time to look at alternatives that prioritize simplicity and reliability.

Check out the repository, documentation, and architecture here: Gubernator on GitHub Pages:

https://mario-ezquerro.github.io/gubernator/

I’d love to hear from the community: at what point do you think Kubernetes becomes too big for its own good? Drop your thoughts in the comments!

DEV Community: Google Developer Experts

[Gemini][Agent] Google Managed Agents API

Preamble: The era of hand-rolling your own agent loop is coming to an end

Three Key Core Capabilities

1. Persistent Sandbox + Filesystem

2. Built-in Tools

3. Multi-turn Session Chaining

Two APIs: Agents for Control Plane, Interactions for Data Plane

What Developers Truly Care About: Cost and Integration Cost

Standard Workflow: 4 SDK Calls to Complete an Agent Interaction

Demo Case: LINE Research Planner Bot

Deployment Practice: Commit → Go Live Automatically

Pitfall Log: Five Pre-GA-Specific Issues

Pitfall One: Synchronous Calls → Mysterious RESOURCE_PROJECT_INVALID

Pitfall Two: gsutil in the Sandbox is a Mock (This one is the most insidious)

Pitfall Three: Cloud Run's /healthz is Intercepted by Google Frontend

Pitfall Four: Service Account Needs to actAs Itself for Cloud Tasks OIDC to Sign

Gubernator [the kill ku8s]

SoftwareEngineering #GoLang #SRE #DevOps #ContainerOrchestration #OpenSource #SystemsArchitecture #Minimalism

I Built an Autonomous AI Agent with Google ADK + Gemini That Spots Trends and Drafts Dev.to Articles for Me

What We're Building

Prerequisites

Step 1: Giving the Agent its "Hands" (API Tools)

Tool 1: Finding Rising Articles

Tool 2: Fetching Community Comments

Tool 3: The Ghostwriter

Step 2: Wiring up the Agent

The "Aha!" Moment

Final Thoughts

31B — Gemma 4 Deployment with NVIDIA L4, MCP, Cloud Run, and Antigravity CLI

What is this project trying to Do?

Antigravity CLI

Testing the Antigravity CLI Environment

Full Installation Instructions

Python MCP Documentation

Where do I start?

Setup the Basic Environment

Model Management Tool with MCP Stdio Transport

Running the Python Code

MCP stdio Transport

Antigravity CLI mcp_config.json

Validation with Antigravity CLI

Getting Started with Gemma 4 on GPU

Model Lifecycle Management via MCP

Deploy The Model

Checking System status

Cross Check The Deployed Model

Check the Cloud Run Service

Review the Model

Benchmark the Local Model

Comparison with 6000 GPU

Summary

26B Gemma 4 Deployment with NVIDIA L4, MCP, Cloud Run, and Antigravity CLI

What is this project trying to Do?

Antigravity CLI

Testing the Antigravity CLI Environment

Full Installation Instructions

Python MCP Documentation

Where do I start?

Setup the Basic Environment

Model Management Tool with MCP Stdio Transport

Running the Python Code

MCP stdio Transport

Antigravity CLI mcp_config.json

Validation with Antigravity CLI

Getting Started with Gemma 4 on GPU

Model Lifecycle Management via MCP

Deploy The Model

Checking System status

Cross Check The Deployed Model

Check the Cloud Run Service

Review the Model

Benchmark the Local Model

Comparison with 6000 GPU

Cost per Token

Summary

Local Mac Gemma 4 Deployment with MCP and Antigravity CLI

What is this project trying to Do?

Antigravity CLI

Testing the Antigravity CLI Environment

Pitfall One: Synchronous Calls → Mysterious `RESOURCE_PROJECT_INVALID`

Pitfall Two: `gsutil` in the Sandbox is a Mock (This one is the most insidious)

Pitfall Three: Cloud Run's `/healthz` is Intercepted by Google Frontend

Pitfall Four: Service Account Needs to `actAs` Itself for Cloud Tasks OIDC to Sign