This article provides a step by step deployment guide for Gemma 4 to a Google Cloud Run hosted GPU enabled system,. A suite of Python MCP tools is built to simplify management of the vLLM hosted Gemma 4 deployment with Antigravity CLI.
What is this project trying to Do?
This project is a DevOps/SRE assistant that uses a Gemma 4 model hosted on Cloud Run with GPU. It provides tools to provision the Docker container and deploy the model, as well as for observability and performance testing.
This project is similar to a previous project that targeted GPU hosted Gemma4 instances on GCP:
Gemma-SRE: Self-Hosted vLLM Infrastructure Agent
Antigravity CLI
Antigravity CLI is the follow-on successor to Gemini CLI- the terminal driven, agent assisted coding tool.
Full details on installing Antigravity CLI are here:
Getting Started with Antigravity CLI
Testing the Antigravity CLI Environment
Once you have all the tools in place- you can test the startup of Antigravity CLI.
You will need to authenticate with a Google Cloud Project or your Google Account:
agy
This will start the interface:
Full Installation Instructions
The detailed installation instructions for Antigravity CLI are here:
Getting Started with Antigravity CLI
Python MCP Documentation
The official GitHub Repo provides samples and documentation for getting started:
Where do I start?
The strategy for starting MCP development for model management is a incremental step by step approach.
First, the basic development environment is setup with the required system variables, and a working Antigravity CLI configuration.
Then, a minimal Python MCP Server is built with stdio transport. This server is validated with Antigravity CLI in the local environment.
This setup validates the connection from Antigravity CLI to the local server via MCP. The MCP client (Antigravity CLI) and the Python MCP server both run in the same local environment.
Setup the Basic Environment
At this point you should have a working Python environment and a working Antigravity CLI installation. The next step is to clone the GitHub samples repository with support scripts:
cd ~
git clone https://github.com/xbill9/gemma4-tips
Then run init.sh from the cloned directory.
The script will attempt to determine your shell environment and set the correct variables:
gpu-31B-6000-devops-agent
source init.sh
If your session times out or you need to re-authenticate- you can run the set_env.sh script to reset your environment variables:
gpu-31B-6000-devops-agent
source set_env.sh
Variables like PROJECT_ID need to be setup for use in the various build scripts- so the set_env script can be used to reset the environment if you time-out.
Model Management Tool with MCP Stdio Transport
One of the key features that the standard MCP libraries provide is abstracting various transport methods.
The high level MCP tool implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.
The simplest transport that the SDK supports is the stdio (stdio/stdout) transportβββwhich connects a locally running process. Both the MCP client and MCP Server must be running in the same environment.
The connection over stdio will look similar to this:
# Initialize FastMCP server
mcp = FastMCP("Self-Hosted vLLM DevOps Agent")
Running the Python Code
First- switch the directory with the Python version of the MCP sample code:
~/gemma4-tips/gpu-31B-6000-devops-agent
Run the release version on the local system:
make install
Processing ./.
The project can also be linted:
xbill@penguin:~/gemma4-tips/gpu-31B-6000-devops-agent$ make lint
ruff check .
All checks passed!
ruff format --check .
6 files already formatted
mypy .
Success: no issues found in 6 source files
And a test run:
xbill@penguin:~/gemma4-tips/gpu-31B-6000-devops-agent$ make test
python test_agent.py
2026-05-29 16:36:41,800 - vllm-devops-agent - INFO - Initializing DevOps Agent MCP Server...
.........2026-05-29 16:36:41,850 - vllm-devops-agent - INFO - Querying Cloud Run model with prompt: 'Hello...'
2026-05-29 16:36:41,850 - vllm-devops-agent - INFO - Model response: 'Response from Gemma...'
.2026-05-29 16:36:41,851 - vllm-devops-agent - INFO - Querying model with stats with prompt: 'Hello...'
2026-05-29 16:36:41,851 - vllm-devops-agent - INFO - Model response with stats: TTFT=0.000s, TotalTime=0.000s
.......
----------------------------------------------------------------------
Ran 17 tests in 0.027s
OK
xbill@penguin:~/gemma4-tips/gpu-31B-6000-devops-agent$
MCP stdio Transport
One of the key features that the MCP protocol provides is abstracting various transport methods.
The high level tool MCP implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.
The simplest transport that the SDK supports is the stdio (stdio/stdout) transportβββwhich connects a locally running process. Both the MCP client and MCP Server must be running in the same environment.
In this project Antigravity CLI is used as the MCP client to interact with the Python MCP server code.
Antigravity CLI mcp_config.json
A sample MCP server file is provided in the .agents directory:
{
"mcpServers": {
"gpu-31b-6000-devops-agent": {
"command": "python3",
"args": [
"/home/xbill/gemma4-tips/gpu-31B-6000-devops-agent/server.py"
],
"env": {
"GOOGLE_CLOUD_PROJECT": "aisprint-491218",
"GOOGLE_CLOUD_LOCATION": "us-central1",
"VLLM_BASE_URL": "https://gpu-31B-6000-devops-agent-289270257791.us-central1.run.app",
"MODEL_NAME": "/mnt/models/gemma-4-31B-it"
}
}
}
}
Validation with Antigravity CLI
The final connection test uses Antigravity CLI as a MCP client with the Python code providing the MCP server:
MCP Servers
Plugins (~/.gemini/antigravity-cli/plugins)
> β google-dev-knowledge Tools: search_documents, answer_query, get_documents
β gpu-31b-6000-devops-agent Tools: save_hf_token, get_vllm_endpoint, list_vertex_models, list_bucket_models,
analyze_cloud_logging, +19 more
Getting Started with Gemma 4 on GPU
The Official vLLM repo also has Gemma4 specific information:
Model Lifecycle Management via MCP
The MCP tools provide a complete suite of agent-oriented operations for managing vLLM deployment on Cloud Run or a TPU.
Overview of MCP tools :
> list mcp tools
β gpu-31b-6000-devops-agent/get_help(Get help details from agent) (ctrl+o to expand)
Here is the list of available Model Context Protocol (MCP) tools provided by the gpu-31b-6000-devops-agent server:
### π³ Infrastructure & Deployment
β’ deploy_vllm : Deploys vLLM to Cloud Run GPU (NVIDIA RTX PRO 6000 in us-central1 ).
β’ destroy_vllm : Deletes the Cloud Run vLLM service.
β’ status_vllm : Checks the status of the Cloud Run vLLM service.
β’ update_vllm_scaling : Updates min/max instances for scaling.
β’ get_vllm_deployment_config : Generates the gcloud deployment command.
β’ get_vllm_gpu_deployment_config : Generates a GKE manifest for GPU (NVIDIA L4).
β’ check_gpu_quotas : Checks RTX 6000 and other GPU quotas for a region.
### π¦ Model Management
β’ list_vertex_models : Lists models in the Vertex AI Registry.
β’ list_bucket_models : Lists model weights in the GCS bucket.
β’ save_hf_token : Securely saves a Hugging Face API token to Secret Manager.
β’ get_vertex_ai_model_copy_instructions : Returns instructions to copy a model from Vertex AI Model Garden to GCS.
β’ get_huggingface_model_copy_instructions : Returns instructions to download a model from Hugging Face and upload it
to GCS.
β’ get_huggingfacehub_download_path : Resolves the local cache path using huggingface_hub .
### π Monitoring & Status
β’ get_system_status : Provides a high-level status dashboard of the Cloud Run service and health.
β’ get_endpoint : Verifies connectivity and returns the active service URL.
β’ get_model_details : Retrieves detailed model metadata and engine state from /v1/models .
β’ verify_model_health : Deep health check by querying the model with a simple prompt and measuring latency.
### π Performance & Benchmarking
β’ run_benchmark : Runs performance/concurrency benchmark sweeps against the Cloud Run vLLM GPU endpoint.
### π¬ Interaction & Diagnostics
β’ query_gemma4 : Primary tool to query the self-hosted model with standard chat message format.
β’ query_gemma4_with_stats : Queries the model and returns streaming performance statistics (TTFT, throughput).
β’ query_vllm : Direct text completions querying tool.
β’ analyze_cloud_logging : Fetches logs from GCP Logging and analyzes them using the model.
β’ analyze_gpu_logs : Fetches Cloud Run logs and uses Gemma 4 to analyze them for SRE/DevOps errors.
β’ suggest_sre_remediation : Suggests remediation plans for SRE errors using the model.
β’ get_help : Retrieves the help menu, active environment configuration, and tool mappings.
Deploy The Model
The MCP server tools are used to deploy the model to Cloud Run with TPU:
The compilation step and warm-up run have successfully completed:
β’ torch.compile took 67.21 seconds.
β’ FlashInfer JIT autotuning completed.
β’ CUDA graph memory profiling is complete (5/5 CUDA graphs captured).
The engine is now starting up the OpenAI API router. Port 8080 will be open and the revision will be marked active
within seconds.
β Bash(make query PROMPT="Hello Gemma 4!") (ctrl+o to expand)
I have initiated a test query ( make query ) against the new endpoint to verify the vLLM serving stack is fully
functional and responding to prompts. I'll notify you as soon as the response arrives.
The deployment is fully complete and functional!
Here is the final summary:
β’ Model Copy: All weights and tokenizer configurations for google/gemma-4-31B-it (totaling 58.28 GiB) were
successfully transferred from Hugging Face directly to your Google Cloud Storage bucket ( gs://aisprint-491218-
bucket/gemma-4-31B-it/ ).
β’ Deployment: The Cloud Run GPU service was successfully deployed under the name gpu-31b-6000-devops-agent pointing
to the newly uploaded 31B model folder.
β’ Initialization: The vLLM serving stack initialized, loaded the 31B model using the Model Streamer, successfully
performed torch.compile graph optimization, and compiled CUDA graphs.
β’ Verification: A test query was sent to the deployed Cloud Run endpoint and returned a successful completions
response structure showing full connectivity and model execution.
The endpoint is live and serving traffic at:
https://gpu-31b-6000-devops-agent-289270257791.us-central1.run.app
Checking System status
The status can be checked with an MCP tool:
> get_system_status
β gpu-31b-6000-devops-agent/get_system_status(Retrieve system status details) (ctrl+o to expand)
The system status has been successfully retrieved:
### π GPU Cloud Run System Status
β’ vLLM Health: π’ Online (vLLM Endpoint https://gpu-31B-6000-devops-agent-289270257791.us-central1.run.app)
β’ Cloud Run Service Status: π’ Ready
Next Step: You can use the query_gemma4 tool to interact with the model.
Cross Check The Deployed Model
Once the model starts locally- the MCP tool allows for model verification:
> verify_model_health
β gpu-31b-6000-devops-agent/verify_model_health(Verify model health status) (ctrl+o to expand)
The model health check has passed successfully:
β’ Status: β
PASSED
β’ Model: gpu-31b-6000-devops-agent
β’ Response: "Yes, I am working! How can I help you today?..."
β’ Latency: 0.44 seconds
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
>
and model details:
> get model details
β gpu-31b-6000-devops-agent/get_model_details(Retrieve details about the serving model) (ctrl+o to expand)
The model details have been retrieved successfully:
### π§© Model Details (vLLM Endpoint https://gpu-31B-6000-devops-agent-289270257791.us-central1.run.app)
Model Information ( /v1/models ):
[
{
"id": "gpu-31b-6000-devops-agent",
"object": "model",
"owned_by": "vllm"
}
]
Health Status ( /health ):
β’ Status: Healthy β
and the endpoint:
> get endpoint
β gpu-31b-6000-devops-agent/get_endpoint(Retrieve the endpoint URL of the serving stack) (ctrl+o to expand)
The Cloud Run vLLM endpoint is online at:
π’ Endpoint URL: https://gpu-31B-6000-devops-agent-289270257791.us-central1.run.app
Check the Cloud Run deployment status:
> show the cloud run docker status
β gpu-31b-6000-devops-agent/status_vllm(Get status of vLLM Cloud Run service)
β Search(Search for status_vllm)
β Read(/home/xbill/gemma4-tips/gpu-31B-6000-devops-agent/server.py)
β Bash(gcloud run services describe gpu-31b-6000-devops-agent --project=aisprint-491218 --region=us-c...) (ctrl+o to
expand)
The container (Docker) configuration and status for the Cloud Run service gpu-31b-6000-devops-agent are:
β’ Docker Image: vllm/vllm-openai:latest
β’ Port: 8080 (HTTP)
β’ Resource Limits:
β’ CPU: 20 Cores
β’ Memory: 80 GiB
β’ GPU: 1 NVIDIA GPU (RTX Pro 6000)
β’ Volume Mounts: /mnt/models mounted to model-volume
β’ Startup Probe: Starts checking after 240 seconds, with a 15-second period and 120 failure threshold (designed to
accommodate large model loading time).
β’ Entrypoint / Args:
bash -c "vllm serve /mnt/models/gemma-4-31B-it \
--served-model-name gpu-31b-6000-devops-agent \
--enable-log-requests \
--enable-chunked-prefill \
--enable-prefix-caching \
--generation-config auto \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--dtype bfloat16 \
--quantization fp8 \
--kv-cache-dtype fp8 \
--max-num-seqs 8 \
--gpu-memory-utilization 0.95 \
--tensor-parallel-size 1 \
--load-format runai_streamer \
--port 8080 \
--host 0.0.0.0 \
--max-model-len 32767"
Check the Cloud Run Service
Once Deployedβββthe model is visible as a standard Cloud Run Service:
Review the Model
Check the model details:
> what are the model details?
β gpu-31b-6000-devops-agent/get_model_details(Retrieve details about the serving model) (ctrl+o to expand)
The credentials have been refreshed successfully! Here are the model details:
### π§© Model Details (vLLM Endpoint https://gpu-31B-6000-devops-agent-289270257791.us-central1.run.app)
Model Information ( /v1/models ):
[
{
"id": "gpu-31b-6000-devops-agent",
"object": "model",
"owned_by": "vllm"
}
]
Health Status ( /health ):
β’ Status: Healthy β
Once everything is onlineβββyou can send queries to the Model:
> query_gemma4_with_stats what is gemma?
β gpu-31b-6000-devops-agent/query_gemma4_with_stats(Query model for Gemma info and return statistics) (ctrl+o to expand)
Here are the performance statistics and the model's response:
### π Performance Stats
β’ Model: gpu-31b-6000-devops-agent
β’ Time to First Token (TTFT): 0.093s
β’ Total Generation Time: 15.658s
β’ Tokens per Second: 39.13 tokens/s
β’ Total Tokens (approx.): 609
### π¬ Model Response
Gemma is a family of lightweight, state-of-the-art open-weights large language models (LLMs) developed by Google
DeepMind.
Think of Gemma as the "smaller, open sibling" of Gemini, Google's most powerful AI. While Gemini is a closed system
available via API or chat interface, Gemma is designed to be downloaded and run on your own hardware.
Here is a detailed breakdown of what makes Gemma unique:
### 1. "Open-Weights" vs. "Open Source"
It is important to note that Gemma is open-weights, not fully open-source.
β’ Open-Source usually means the entire training data and code are public.
β’ Open-Weights means Google provides the "brain" (the trained parameters) for free, allowing developers to run it
locally and fine-tune it for specific tasks, but the exact training data remains proprietary.
### 2. Relationship to Gemini
Gemma was built using the same research, technology, and infrastructure used to create the Gemini models. This means
it inherits much of the reasoning, coding, and mathematics capability of the larger Gemini models, but in a package
that is small enough to fit on a personal laptop or a workstation.
### 3. Key Versions
Gemma typically comes in different sizes to suit different hardware:
β’ Gemma 2B: Extremely small; can run on a mobile phone or a basic laptop.
β’ Gemma 9B / 27B: Mid-sized models that offer a balance between high intelligence and performance. The 27B version is
designed to compete with much larger models (like Llama 3) while remaining efficient.
### 4. What can you do with Gemma?
Because you can host Gemma on your own machine, it is popular for:
β’ Privacy: Processing sensitive data without sending it to a cloud server.
β’ Customization (Fine-Tuning): A developer can take Gemma and "teach" it a specific medical dataset or a company's
internal documentation to make it an expert in a niche field.
β’ Research: Academics use it to study how LLMs work without needing a supercomputer.
β’ Edge Computing: Running AI locally on devices that don't have a constant internet connection.
### 5. How does it compare to others?
Gemma is Google's direct competitor to Meta's Llama (e.g., Llama 3) and Mistral AI's Mistral/Mixtral models. All
three are fighting to be the "standard" model that developers use to build their own AI applications.
In short: Gemma is Google's high-performance, portable AI model that allows developers to build and run their own AI
apps privately and locally.
Benchmark the Model
The model was benchmarked:
The benchmark sweep has completed successfully. Here are the compiled results across all combinations of context
window sizes and concurrency levels:
The benchmark sweep across 1 to 256 concurrent users and 8 to 16,384 context tokens has successfully completed.
### π Generated Files
β’ Performance Report: benchmark_report.md
β’ Raw CSV Results: matrix_benchmark_results.csv
β’ Performance Chart: rtx_6000_benchmark.png
β’ Conversation Artifact: benchmark_report.md
### π Key Results Summary
β’ Continuous Batching Efficiency: The vLLM engine scales exceptionally well under load. Under maximum load (256
concurrent users with 16,384 context window), the token generation rate peaked at ~344.96 tokens/sec.
β’ Latency Profile: Latency is highly predictable, starting at sub-second/single-digit latency for lower concurrency
profiles, scaling linearly to approximately 41.43s for 256 concurrent users at 16,384 tokens.
β’ Warmup/Cold Starts: During the transition to the 512 token context sweep, a container restart and graph
compilation occurred, resulting in temporarily lower throughput before graph caching took over and returned the
system to optimal speed.
The analysis is reviewed:
### π Key Takeaways
1. Continuous Batching Efficiency ($C=16$ Sweet Spot):
Scaling concurrency from 1 to 16 increases total token throughput by almost 8x (jumping from ~38 tok/s to ~300
tok/s) while average latency only increases slightly (from 3.32s to 5.17s).
2. Throughput Saturation:
The RTX 6000 Pro's token throughput caps out at ~310β345 tokens/sec. Pushing concurrency past 32 up to 256 does
not yield further throughput gains, but instead results in linear queuing latency (latency reaches up to ~54.87s).
3. Prefix Caching Phenomenon (16,384 tokens):
Under high load, large context windows (16k tokens) actually completed faster than short context windows (avg latency
of 41.43s vs 54.87s at 256 concurrency). This is due to vLLM's automatic prefix caching which evaluates the large
prompt KV cache once and shares it, reducing prefill time to $O(1)$ for subsequent concurrent requests.
### π οΈ SRE Configuration Recommendations
β’ Cloud Run Concurrency: Limit instance concurrency to --concurrency=16 or --concurrency=32 to trigger auto-
scaling to new instances rather than queuing on a single saturated GPU.
β’ Prefix Caching: Always run with the --enable-prefix-caching flag enabled to optimize shared prompt contexts (e.g.
system instructions, repeated log dumps).
β’ Startup Delays: Set startup timeouts to at least 240 seconds to accommodate container cold-starts, safetensors
downloading, and torch.compile graph optimization.
This is visualized:
Compare to TPU Deployment
The same model was deployed on a v6e-4 TPU for comparison:
I have generated a side-by-side performance comparison between the RTX 6000 Pro GPU and Cloud TPU v6e-4.
π Detailed Comparison Report: gpu_tpu_comparison_report.md
πΌοΈ Visual Comparison Chart: gpu_tpu_comparison.png
### π Key Performance Differences
1. Peak Throughput (TPU Dominance):
β’ For short contexts (8 tokens), TPU v6e-4 reaches ~1,379 tokens/sec compared to the GPU's ~308 tokens/sec (4.5x
faster).
β’ For long contexts (16,384 tokens), TPU v6e-4 achieves ~3,206 tokens/sec compared to the GPU's ~344 tokens/sec
(9.3x faster).
2. Latency (Sub-Second Responses):
β’ TPU latency stays extremely low under medium load (averaging 0.46s at concurrency 16), whereas the GPU averages
5.17s for the same level.
3. Architectural Fit:
β’ NVIDIA RTX 6000 Pro GPU (Cloud Run): Perfect for cost-sensitive, low-to-medium concurrency apps, or workflows
with high idle periods (leveraging Cloud Run's scale-to-zero model).
β’ Cloud TPU v6e-4: Essential for high-traffic enterprise gateways, real-time streaming, and massive batch log-
parsing operations where high context and high concurrency are standard.
π Architectural Deep Dive Report: gpu_tpu_deep_dive.md
### π¬ Core Hardware Findings
1. Memory Bandwidth Bottleneck (Decode Phase):
During token generation (decode), execution is entirely memory-bandwidth bound because the model parameters must be
read from VRAM for every single token. The TPU v6e-4's HBM2e memory supplies ~6.4 TB/s of bandwidth compared to the
GPU's 960 GB/s (GDDR6). This 6.6x advantage is why the TPU handles high concurrency without saturating.
2. Distributed Processing (Tensor Parallelism):
The TPU v6e-4 config splits the 31B model across 4 chips ( --tensor-parallel-size 4 ) communicating over an ultra-low
latency Inter-Chip Interconnect (ICI). The RTX 6000 Pro runs on a single card ( --tensor-parallel-size 1 ),
processing all weights on a single GPU core.
3. Compiler Efficiency (XLA vs. CUDA):
TPUs compile the model graph using XLA (Accelerated Linear Algebra), which optimizes matrix operations and minimizes
operator execution overhead. The GPU relies on CUDA Graphs and torch.compile which, while fast, has higher startup
graph compilation times and launch latencies.
The results were graphed:
and recommendations generated:
π Recommendations Document: production_recommendations.md
### π Summary of Key Actions
1. Hardware Strategy:
β’ Use RTX 6000 Pro (Cloud Run) for SRE troubleshooting tools, development, and low-traffic applications where
scale-to-zero saves significant cost.
β’ Use Cloud TPU v6e-4 for high-volume enterprise APIs or systems handling massive context windows (16,384+
tokens) at high concurrency.
2. Cloud Run Tuning:
β’ Set --concurrency=16 or --concurrency=32 to avoid saturating a single GPU, directing excess traffic to auto-
scaled new instances.
β’ Configure --timeout=300 and TCP startup probes to at least 240 seconds to allow for weight streaming and
graph optimization.
3. Prefix Caching Alignment:
β’ Keep --enable-prefix-caching active.
β’ Organize prompts from static (system parameters, database schemas) to dynamic (user query, timestamped logs) to
maximize prefix cache hit ratios.
Finallyβββa basic cost analsys:
I have performed a cost-per-token analysis comparing the NVIDIA RTX 6000 Pro GPU (Cloud Run) to the Cloud TPU v6e-4
based on GCP's resource pricing models.
π Detailed Cost Analysis Report: cost_per_token_analysis.md
### π Cost-per-Million Tokens Comparison (at Concurrency = 16)
1. Short Context (8 Input Tokens):
β’ RTX 6000 Pro GPU: $3.49 per million tokens
β’ Cloud TPU v6e-4: $4.99 per million tokens
β’ Verdict: GPU is 30% cheaper because the machine-hour cost is lower ($3.76/hr vs $5.50/hr).
2. Long Context (16,384 Input Tokens):
β’ RTX 6000 Pro GPU: $4.24 per million tokens
β’ Cloud TPU v6e-4: $0.53 per million tokens
β’ Verdict: TPU is 8x cheaper because the TPU v6e-4 leverages HBM2e memory bandwidth to maintain massive
throughput (2,892 tokens/sec vs 246 tokens/sec on GPU).
### π‘ SRE Recommendation
β’ Spiky/Ad-Hoc Workloads (e.g. On-demand log diagnostics): Use the GPU on Cloud Run. It supports scale-to-zero,
meaning you only pay the $3.76/hr rate while processing requests, saving substantial money over a TPU VM which must
run constantly (charging $132/day even when idle).
β’ Continuous/High-Throughput Production Workloads: Use the Cloud TPU v6e-4. Running TPU instances continuously is
significantly more cost-efficient for heavy text processing and long-context analysis.
Summary
The strategy for using MCP for Gemma 4 GPU deployment with Antigravity CLI was validated with a incremental step by step approach.
A minimal stdio transport MCP Server was started from Python source code and validated with Antigravity CLI running as a MCP client in the same local environment. This Python server provided all of the management tools to deploy and troubleshoot Cloud Run Model deployments.





Top comments (0)