xbill for Google Developer Experts

Posted on May 29 • Originally published at xbill999.Medium on May 28

Gemma 4 Deployment with NVIDIA L4, MCP, Cloud Run, and Antigravity CLI

#googlecloudrun #antigravitycli #gemma4 #mcpserver

This article provides a step by step deployment guide for Gemma 4 to a Google Cloud Run hosted GPU enabled system,. A suite of Python MCP tools is built to simplify management of the vLLM hosted Gemma 4 deployment with Antigravity CLI.

What is this project trying to Do?

This project is a DevOps/SRE assistant that uses a Gemma 4 model hosted on Cloud Run with GPU. It provides tools to provision the Docker container and deploy the model, as well as for observability and performance testing.

This project is similar to a previous project that targeted GPU hosted Gemma4 instances on GCP:

Gemma-SRE: Self-Hosted vLLM Infrastructure Agent

Antigravity CLI

Antigravity CLI is the follow-on successor to Gemini CLI- the terminal driven, agent assisted coding tool.

Full details on installing Antigravity CLI are here:

Getting Started with Antigravity CLI

Testing the Antigravity CLI Environment

Once you have all the tools in place- you can test the startup of Antigravity CLI.

You will need to authenticate with a Google Cloud Project or your Google Account:

agy

This will start the interface:

Full Installation Instructions

The detailed installation instructions for Antigravity CLI are here:

Getting Started with Antigravity CLI

Python MCP Documentation

The official GitHub Repo provides samples and documentation for getting started:

GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients

Where do I start?

The strategy for starting MCP development for model management is a incremental step by step approach.

First, the basic development environment is setup with the required system variables, and a working Antigravity CLI configuration.

Then, a minimal Python MCP Server is built with stdio transport. This server is validated with Antigravity CLI in the local environment.

This setup validates the connection from Antigravity CLI to the local server via MCP. The MCP client (Antigravity CLI) and the Python MCP server both run in the same local environment.

Setup the Basic Environment

At this point you should have a working Python environment and a working Antigravity CLI installation. The next step is to clone the GitHub samples repository with support scripts:

cd ~
git clone https://github.com/xbill9/gemma4-tips

Then run init.sh from the cloned directory.

The script will attempt to determine your shell environment and set the correct variables:

gpu-vllm-devops-agent
source init.sh

If your session times out or you need to re-authenticate- you can run the set_env.sh script to reset your environment variables:

gpu-vllm-devops-agent
source set_env.sh

Variables like PROJECT_ID need to be setup for use in the various build scripts- so the set_env script can be used to reset the environment if you time-out.

Model Management Tool with MCP Stdio Transport

One of the key features that the standard MCP libraries provide is abstracting various transport methods.

The high level MCP tool implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.

The simplest transport that the SDK supports is the stdio (stdio/stdout) transport — which connects a locally running process. Both the MCP client and MCP Server must be running in the same environment.

The connection over stdio will look similar to this:

# Initialize FastMCP server
mcp = FastMCP("Self-Hosted vLLM DevOps Agent")

Running the Python Code

First- switch the directory with the Python version of the MCP sample code:

~/gemma4-tips/gpu-vllm-devops-agent

Run the release version on the local system:

make install
Processing ./.

The project can also be linted:

xbill@penguin:~/gemma4-tips/gpu-vllm-devops-agent$ make lint
ruff check .
All checks passed!
ruff format --check .
6 files already formatted
mypy .
Success: no issues found in 6 source files

And a test run:

xbill@penguin:~/gemma4-tips/gpu-vllm-devops-agent$ make test
python test_agent.py
2026-05-28 12:34:44,071 - vllm-devops-agent - INFO - Initializing DevOps Agent MCP Server...
.........2026-05-28 12:34:44,110 - vllm-devops-agent - INFO - Querying Cloud Run model with prompt: 'Hello...'
2026-05-28 12:34:44,110 - vllm-devops-agent - INFO - Model response: 'Response from Gemma...'
.2026-05-28 12:34:44,112 - vllm-devops-agent - INFO - Querying model with stats with prompt: 'Hello...'
2026-05-28 12:34:44,112 - vllm-devops-agent - INFO - Model response with stats: TTFT=0.000s, TotalTime=0.000s
.......
----------------------------------------------------------------------
Ran 17 tests in 0.023s

OK

MCP stdio Transport

One of the key features that the MCP protocol provides is abstracting various transport methods.

The high level tool MCP implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.

In this project Antigravity CLI is used as the MCP client to interact with the Python MCP server code.

Antigravity CLI mcp_config.json

A sample MCP server file is provided in the .agents directory:

{
  "mcpServers": {
    "gpu-devops-agent": {
      "command": "python3",
      "args": [
        "/home/xbill/gemma4-tips/gpu-vllm-devops-agent/server.py"
      ],
      "env": {
        "GOOGLE_CLOUD_PROJECT": "aisprint-491218",
        "GOOGLE_CLOUD_LOCATION": "us-east4",
        "VLLM_BASE_URL": "https://vllm-gemma-4-e4b-it-289270257791.us-east4.run.app",
        "MODEL_NAME": "/mnt/models/gemma-4-E4B-it"
      }
    }
  }
}

Validation with Antigravity CLI

The final connection test uses Antigravity CLI as a MCP client with the Python code providing the MCP server:

MCP Servers

Configured (~/.gemini/antigravity-cli/mcp_config.json)
> ✓ developer-knowledge Tools: search_documents, answer_query, get_documents

Plugins (~/.gemini/antigravity-cli/plugins)
   ✓ google-dev-knowledge Tools: search_documents, answer_query, get_documents
   ✓ gpu-devops-agent Tools: save_hf_token, get_vllm_endpoint, list_vertex_models, list_bucket_models,
                       analyze_cloud_logging, +21 more

Getting Started with Gemma 4 on GPU

Cloud Run GPU is tied to quota limits:

> check cloud run quota in us-east4

This will return the current quotas:

  Presenting Quotas Clearly
  Here are the Google Cloud Run quotas for project aisprint-491218 in the us-east4 region:

  ### 📋 Core Resource Quotas

  • Services ( ServicesPerProject ): 1,000 limit per region.
  • Jobs ( JobsPerProject ): 1,000 limit per region.
  • Worker Pools ( WorkerPoolsPerProject ): 1,000 limit per region.
  • Active Revisions ( ActiveRevisionsPerProject ): 4,000 limit.
  • Running Executions ( RunningExecutionsPerProject ): 1,000 limit.
  • Tags ( TagsPerProject ): 2,000 limit.

  ### ⚡ Compute & Storage Allocations

  • Total CPU Allocation ( CpuAllocPerProjectRegion ): 5,000,000 milli vCPUs (equivalent to 5,000 vCPUs).
  • Total Memory Allocation ( MemAllocPerProjectRegion ): 10,737,418,240,000 bytes (equivalent to 10 TiB).
  • Max Per-Instance Ephemeral Disk ( MaxPerInstanceEphemeralDiskSizePerProjectRegion ): 10,737,418,240 bytes 
  (equivalent to 10 GiB).
  • Direct VPC Egress Instance Limit ( MaxInstancesLimitWithDirectVpcEgressPerProjectRegion ): 100 instances.

  ### 🎮 GPU Quotas

  • Nvidia L4 GPUs without zonal redundancy ( NvidiaL4GpuAllocNoZonalRedundancyPerProjectRegion ): 3 GPUs.
  • Nvidia L4 GPUs with zonal redundancy ( NvidiaL4GpuAllocPerProjectRegion ): No specific limits applied.
  • Nvidia RTX Pro 6000 GPUs ( NvidiaRtxPro6000GpuAllocNoZonalRedundancyPerProjectRegion ): No specific limits applied.

  ### ⏱️ API Rate Limits (per minute)

  • Read Requests ( ReadPerMinutePerProjectRegion ): 3,000 requests/min.
  • Write Requests ( WritePerMinutePerProjectRegion ): 180 requests/min.
  • Job Run Requests ( JobRunPerMinutePerProjectRegion ): 180 requests/min.
  • Inlined Source Deployments ( InlinedSourceDeployPerMinutePerProjectRegion ): 5 requests/min.
  • Source Deploy Bytes ( SourceDeployBytesPerMinutePerProjectRegion ): 2,000,000,000 bytes (equivalent to 2 GB /
  min).

You can also use the Antigravity CLI to submit a quota increase:

> request RTX Pro 6000 GPUs cloud run quota of 3 in us-east4
● google-dev-knowledge/search_documents(Quota increase command search)
● Read(/home/xbill/.gemini/antigravity-cli/brain/5a6...4e9cc0a/.system_generated/steps/32/output.txt) (ctrl+o to expand)

▸ Thought for 1s, 511 tokens
  Clarifying The Scope

● Search(Searching for L4 GPU metric)
● Read(/home/xbill/.gemini/antigravity-cli/brain/5a65b024-1b52-4d0b-b816-652984e9cc0a/scratch/run_quotas.yaml)
● Search(Grep search for RTX Pro 6000)
● Read(/home/xbill/.gemini/antigravity-cli/brain/5a65b024-1b52-4d0b-b816-652984e9cc0a/scratch/run_quotas.yaml)
● Bash(gcloud beta quotas preferences create --service=run.googleapis.com --project=aisprint-491218 --quota-
id=NvidiaRtx...)
● Bash(gcloud beta quotas preferences create --service=run.googleapis.com --project=aisprint-491218 -...) (ctrl+o to
expand)

  I have submitted requests to increase both RTX Pro 6000 GPU quotas to 3 GPUs (represented as 3000 milli GPUs in the
  API) in us-east4 for project aisprint-491218 .

  Here are the details of the submitted quota preferences:

  ### 1. With Zonal Redundancy

  • Quota ID: NvidiaRtxPro6000GpuAllocPerProjectRegion 
  • Service: run.googleapis.com 
  • Region: us-east4 
  • Requested Limit: 3000 milli GPUs (3 GPUs)
  • Preference Name: projects/aisprint-491218/locations/global/quotaPreferences/bdf815c3-a947-4069-91ae-8b6c6c0dd345 
  • Status: Submitted (reconciling)

  ### 2. Without Zonal Redundancy

  • Quota ID: NvidiaRtxPro6000GpuAllocNoZonalRedundancyPerProjectRegion 
  • Service: run.googleapis.com 
  • Region: us-east4 
  • Requested Limit: 3000 milli GPUs (3 GPUs)
  • Preference Name: projects/aisprint-491218/locations/global/quotaPreferences/2f6ccda1-76a5-4867-8f89-78948cd725b8 
  • Status: Submitted (reconciling)

The Official vLLM repo also has Gemma4 specific information:

Releases · vllm-project/vllm

Model Lifecycle Management via MCP

The MCP tools provide a complete suite of agent-oriented operations for managing vLLM deployment on Cloud Run or a TPU.

Overview of MCP tools :

 ### ⚙️ Active Environment Configuration

  • Project ID: aisprint-491218 
  • Region: us-east4 
  • Model Name: /mnt/models/gemma-4-E4B-it 
  • Active Serving Endpoint: Cloud Run Service https://vllm-gemma-4-e4b-it-289270257791.us-east4.run.app
  • GCS Storage Bucket: aisprint-491218-bucket 
  ──────
  ### 🧰 Summary of Available MCP Tools

  #### 🐳 Infrastructure & Deployment

  • deploy_vllm : Deploys vLLM to Cloud Run GPU (NVIDIA L4 in us-east4 ).
  • destroy_vllm : Deletes the Cloud Run vLLM service.
  • status_vllm : Checks the status of the Cloud Run vLLM service.
  • update_vllm_scaling : Configures min/max instances.
  • get_vllm_deployment_config / get_vllm_gpu_deployment_config : Outputs deployment configs for Cloud Run and GKE
  (NVIDIA L4).
  • check_gpu_quotas : Checks GPU resource availability and quotas.

  #### 📦 Model Management

  • list_vertex_models / list_bucket_models : Lists available model registries and GCS weight locations.
  • save_hf_token : Secures your Hugging Face API token in Secret Manager.
  • get_vertex_ai_model_copy_instructions / get_huggingface_model_copy_instructions : Guide commands to ingest
  models.

  #### 📊 Monitoring, Diagnostics & Benchmarks

  • get_system_status : Pulls high-level dashboard metrics and active settings.
  • verify_model_health : Verifies latency and connectivity.
  • run_benchmark : Performs concurrency sweeps and latency profiling.
  • analyze_cloud_logging / analyze_gpu_logs : Diagnoses SRE errors and parses system logs using the self-hosted
  model.
  • suggest_sre_remediation : Automatically generates remediation steps for detected errors.

Deploy The Model

The MCP server tools are used to deploy the model to Cloud Run with TPU:

deploy the model

  The container logs show that the vLLM engine initialized correctly and began loading the model checkpoint files from
  the GCS FUSE mount ( /mnt/models/gemma-4-E4B-it ) at 18:25:53 .

  Since the weights are read on-demand over the network via FUSE and the engine has to compile CUDA graphs, the
  initialization phase takes several minutes before the startup health probe gets a 200 OK at /health . I've set
  another 90-second status timer and will monitor its progress.

Checking System status

The status can be checked with an MCP tool:

> get_system_status

● gpu-devops-agent/get_system_status(System status check) (ctrl+o to expand)

  The system status is fully online:

  ### 🌀 GPU Cloud Run System Status

  • vLLM Health: 🟢 Online (Endpoint https://vllm-gemma-4-e4b-it-289270257791.us-east4.run.app)
  • Cloud Run Service Status: 🟢 Ready

Cross Check The Deployed Model

Once the model starts locally- the MCP tool allows for model verification:

> verify_model_health

● gpu-devops-agent/verify_model_health(Model health check) (ctrl+o to expand)

  The model health check has successfully passed:

  ### 🏥 Model Health Verification

  • Status: ✅ PASSED
  • Model Path: /mnt/models/gemma-4-E4B-it 
  • Latency: 1.35 seconds 
  • Response Snippet: "Hello! Yes, I am working. I am Gemma 4, a large la..."

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>

and model details:

> get_active_models

● gpu-devops-agent/get_model_details(Model details retrieval) (ctrl+o to expand)

  The active model details from the running vLLM service are as follows:

  ### 🧩 Active Model details

  • Endpoint URL: https://vllm-gemma-4-e4b-it-289270257791.us-east4.run.app 
  • Active Model ID: /mnt/models/gemma-4-E4B-it 
  • Health Status: Healthy ✅

and the endpoint:

> get_endpoint

● gpu-devops-agent/get_endpoint(Endpoint retrieval) (ctrl+o to expand)

  The active endpoint URL for the Cloud Run vLLM deployment is:

  • Endpoint URL: https://vllm-gemma-4-e4b-it-289270257791.us-east4.run.app (🟢 Online)

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>

and status:

> get_system_details

● Bash(gcloud run services describe vllm-gemma-4-e4b-it --region us-east4 --project aisprint-491218 -...) (ctrl+o to
expand)

  Here are the detailed configuration and system specs of the active Cloud Run vLLM deployment:

  ### ⚙️ Hardware & Compute Resources

  • Accelerator: NVIDIA L4 GPU (1 unit)
  • CPU Limit: 8 vCPUs
  • Memory Limit: 32 GiB 
  • Execution Environment: Gen2
  • Storage Mount: aisprint-491218-bucket (GCS Bucket mounted at /mnt/models via GCS Fused volume driver, read-
  only)

  ### 🚀 vLLM Engine Configuration

  • Container Image: vllm/vllm-openai:latest 
  • Entrypoint: python3 -m vllm.entrypoints.openai.api_server 
  • Startup Arguments:
      • --model=/mnt/models/gemma-4-E4B-it 
      • --dtype=bfloat16 
      • --max-model-len=16384 
      • --gpu-memory-utilization=0.95 
      • --kv-cache-dtype=fp8 
      • --limit-mm-per-prompt={} 
      • --tensor-parallel-size=1 
      • --max-num-seqs=8 
      • --max-num-batched-tokens=4096 
      • --enable-auto-tool-choice 
      • --tool-call-parser=gemma4 
      • --reasoning-parser=gemma4 

  ### 📈 Network & Autoscaling

  • Container Concurrency: 4 simultaneous requests
  • Autoscaling Target: Max Revision Scale = 1 
  • Startup Probe: HTTP GET /health on port 8000 (Initial delay: 180s , Timeout: 60s )

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>

Check the Cloud Run deployment status:

> show the cloud run docker status

● Bash(gcloud run services describe vllm-gemma-4-e4b-it --region us-east4 --project aisprint-491218 -...) (ctrl+o to
expand)

  In Cloud Run, the "docker container status" is tracked via its Service Conditions and Revisions.

  Here is the status of the container deployment:

  ### 📦 Container Service Status

  • Service URL: https://vllm-gemma-4-e4b-it-wgcq55zbfq-uk.a.run.app 
  • Active Container Image: vllm/vllm-openai:latest 
  • Latest Revision: vllm-gemma-4-e4b-it-00023-n4c (healthy and serving 100% of traffic)

  ### 🏥 System Status Conditions

   Condition Type | Status | Last Transition Time | Description
  -----------------------|-----------|------------------------|--------------------------------------------------------
    Ready | 🟢 True | 2026-05-28T18:32:41Z | The container is healthy, initialized, and accepting
                         | | | incoming requests.
    ConfigurationsReady | 🟢 True | 2026-05-28T18:23:35Z | Service parameters, environment variables, volumes,
                         | | | and hardware specs are valid.
    RoutesReady | 🟢 True | 2026-05-28T18:32:41Z | Traffic routing configuration and network load
                         | | | balancing are ready.

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>

Check the Cloud Run Service

Once Deployed — the model is visible as a standard Cloud Run Service:

Review the Model

Check the model details:

> what are the model details?

● gpu-devops-agent/get_model_details(Model details retrieval) (ctrl+o to expand)

  The deployed model details are as follows:

  ### 🧩 Active Serving Model

  • Model ID / Name: /mnt/models/gemma-4-E4B-it 
  • Object Type: model (OpenAI-compatible /v1/models format)
  • Serving Platform: vLLM 
  • Health Status: Healthy ✅

  ### ℹ️ About Gemma 4 (E4B-it)

  • Model Family: Gemma 4 (Google's lightweight, state-of-the-art open models built from the same research and
  technology used to create the Gemini models).
  • Variant: E4B-it (Experimental ~4 Billion parameter Instruction-Tuned model).
  • Optimization: Configured to run in bfloat16 precision with FP8 KV caching to optimize memory usage and processing
  speed on the NVIDIA L4 GPU.
  • Context Window: Configured for up to 16,384 tokens.

Once everything is online — you can send queries to the Model:

> query_gemma4_with_stats what is gemma4?

● gpu-devops-agent/query_gemma4_with_stats(Query Gemma 4) (ctrl+o to expand)

  Here is the response from the model, along with performance statistics:

  ### 💬 Model Response

  │ I am Gemma 4. I am a Large Language Model developed by Google DeepMind. I am an open weights model.

  ### 📊 Serving Performance Stats

  • Model ID: /mnt/models/gemma-4-E4B-it 
  • Time to First Token (TTFT): 0.094s (Excellent startup latency)
  • Tokens per Second: 25.63 tokens/s 
  • Total Generation Time: 1.07s

Benchmark the Local Model

The local model was benchmarked:

 The benchmark sweep has completed successfully. Here are the compiled results across all combinations of context
  window sizes and concurrency levels:

  ### 📈 Concurrency & Context Window Benchmark Sweep (Gemma 4 E4B-it)

   Context Window | Concurrency | Success Rate | Avg Latency (s) | Throughput (tok/s)
  -------------------|------------------------------------|-------------------|------------------|---------------------
   8 | 1 / 2 / 4 / 8 / 16 / 32 / 64 / 128 | 100.0% | 0.69 - 1.88s | 18.90 - 89.64 tok/s
   16 | 1 / 2 / 4 / 8 / 16 / 32 / 64 / 128 | 100.0% | 0.68 - 1.84s | 23.34 - 89.40 tok/s
   32 | 1 / 2 / 4 / 8 / 16 / 32 / 64 / 128 | 100.0% | 0.69 - 1.85s | 22.99 - 88.48 tok/s
   64 | 1 / 2 / 4 / 8 / 16 / 32 / 64 / 128 | 100.0% | 0.15 - 0.49s | 13.48 - 42.61 tok/s
   128 | 1 / 2 / 4 / 8 / 16 / 32 / 64 / 128 | 100.0% | 0.14 - 0.50s | 13.69 - 44.76 tok/s
   256 | 1 / 2 / 4 / 8 / 16 / 32 / 64 / 128 | 100.0% | 0.14 - 0.51s | 13.62 - 43.54 tok/s
   512 | 1 / 2 / 4 / 8 / 16 / 32 / 64 / 128 | 100.0% | 0.15 - 0.51s | 13.03 - 42.63 tok/s
   1024 | 1 / 2 / 4 / 8 / 16 / 32 / 64 / 128 | 100.0% | 0.16 - 0.51s | 12.22 - 41.20 tok/s
   2048 | 1 / 2 / 4 / 8 / 16 / 32 / 64 / 128 | 100.0% | 0.18 - 0.51s | 10.96 - 42.17 tok/s
   4096 | 1 / 2 / 4 / 8 / 16 / 32 / 64 / 128 | 100.0% | 0.21 - 0.56s | 17.45 - 39.08 tok/s
   8192 | 1 / 2 / 4 / 8 / 16 / 32 / 64 / 128 | 100.0% | 0.23 - 0.66s | 17.21 - 34.89 tok/s
   16384 | 1 / 2 / 4 / 8 / 16 / 32 / 64 / 128 | 100.0% | 0.27 - 0.81s | 14.18 - 27.59 tok/s

  ### 🔍 Key Findings

  • 100% Success Rate: Every request across all context window tests completed successfully without a single OOM or
  timeout.
  • Throughput Scaling: High concurrency configurations (16 to 128 concurrent users) hit optimal throughput peaks of 80
  to 90 tok/s for smaller context sizes, and remained highly stable around 25 to 27 tok/s even at the maximum 16,384
  context window.

The analysis is reviewed:

## 🔍 Key Performance Insights

### 1. Throughput Scaling (Left Chart)
* **High Efficiency at Low Concurrency** : With single-user requests (concurrency = 1), throughput is highly consistent acr
oss smaller context sizes.
* **Concurrency Scaling** : As concurrency climbs from 1 to 16, throughput scales up significantly, peaking around **80 to 
90 tokens/second** at smaller context windows. This proves the efficacy of vLLM's continuous batching implementation.
* **Large Context Handling** : At the maximum context window size of **16,384 tokens** , throughput remains extremely stable
 around **25 to 27 tokens/second** , demonstrating that the GPU is fully capable of serving long-context requests under hig
h loads without degradation.

### 2. Latency Characteristics (Right Chart)
* **Sub-Second Response Times** : Across all tested context windows (from 8 to 16,384 tokens), average latency remains well
 below 1 second for standard workloads (concurrency levels 1 to 32).
* **Predictable Scaling** : Even at maximum concurrency (128 concurrent users) and largest context sizes, latency stays und
er **1 second** on average, making this stack highly suitable for real-time applications and SRE automated diagnostic work
flows.

This is visualized:

Summary

The strategy for using MCP for Gemma 4 GPU deployment with Antigravity CLI was validated with a incremental step by step approach.

A minimal stdio transport MCP Server was started from Python source code and validated with Antigravity CLI running as a MCP client in the same local environment. This Python server provided all of the management tools to deploy and troubleshoot Cloud Run Model deployments.

DEV Community