DEV Community

xbill for Google Developer Experts

Posted on • Originally published at xbill999.Medium on

31B Gemma 4 Deployment with NVIDIA Blackwell 6000, MCP, Cloud Run, and Antigravity CLI

This article provides a step by step deployment guide for Gemma 4 to a Google Cloud Run hosted GPU enabled system,. A suite of Python MCP tools is built to simplify management of the vLLM hosted Gemma 4 deployment with Antigravity CLI.

What is this project trying to Do?

This project is a DevOps/SRE assistant that uses a Gemma 4 model hosted on Cloud Run with GPU. It provides tools to provision the Docker container and deploy the model, as well as for observability and performance testing.

This project is similar to a previous project that targeted GPU hosted Gemma4 instances on GCP:

Gemma-SRE: Self-Hosted vLLM Infrastructure Agent

Antigravity CLI

Antigravity CLI is the follow-on successor to Gemini CLI- the terminal driven, agent assisted coding tool.

Full details on installing Antigravity CLI are here:

Getting Started with Antigravity CLI

Testing the Antigravity CLI Environment

Once you have all the tools in place- you can test the startup of Antigravity CLI.

You will need to authenticate with a Google Cloud Project or your Google Account:

agy
Enter fullscreen mode Exit fullscreen mode

This will start the interface:

Full Installation Instructions

The detailed installation instructions for Antigravity CLI are here:

Getting Started with Antigravity CLI

Python MCP Documentation

The official GitHub Repo provides samples and documentation for getting started:

GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients

Where do I start?

The strategy for starting MCP development for model management is a incremental step by step approach.

First, the basic development environment is setup with the required system variables, and a working Antigravity CLI configuration.

Then, a minimal Python MCP Server is built with stdio transport. This server is validated with Antigravity CLI in the local environment.

This setup validates the connection from Antigravity CLI to the local server via MCP. The MCP client (Antigravity CLI) and the Python MCP server both run in the same local environment.

Setup the Basic Environment

At this point you should have a working Python environment and a working Antigravity CLI installation. The next step is to clone the GitHub samples repository with support scripts:

cd ~
git clone https://github.com/xbill9/gemma4-tips
Enter fullscreen mode Exit fullscreen mode

Then run init.sh from the cloned directory.

The script will attempt to determine your shell environment and set the correct variables:

gpu-31B-6000-devops-agent
source init.sh
Enter fullscreen mode Exit fullscreen mode

If your session times out or you need to re-authenticate- you can run the set_env.sh script to reset your environment variables:

gpu-31B-6000-devops-agent
source set_env.sh
Enter fullscreen mode Exit fullscreen mode

Variables like PROJECT_ID need to be setup for use in the various build scripts- so the set_env script can be used to reset the environment if you time-out.

Model Management Tool with MCP Stdio Transport

One of the key features that the standard MCP libraries provide is abstracting various transport methods.

The high level MCP tool implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.

The simplest transport that the SDK supports is the stdio (stdio/stdout) transportβ€Šβ€”β€Šwhich connects a locally running process. Both the MCP client and MCP Server must be running in the same environment.

The connection over stdio will look similar to this:

# Initialize FastMCP server
mcp = FastMCP("Self-Hosted vLLM DevOps Agent")
Enter fullscreen mode Exit fullscreen mode

Running the Python Code

First- switch the directory with the Python version of the MCP sample code:

~/gemma4-tips/gpu-31B-6000-devops-agent
Enter fullscreen mode Exit fullscreen mode

Run the release version on the local system:

make install
Processing ./.
Enter fullscreen mode Exit fullscreen mode

The project can also be linted:

xbill@penguin:~/gemma4-tips/gpu-31B-6000-devops-agent$ make lint
ruff check .
All checks passed!
ruff format --check .
6 files already formatted
mypy .
Success: no issues found in 6 source files
Enter fullscreen mode Exit fullscreen mode

And a test run:

xbill@penguin:~/gemma4-tips/gpu-31B-6000-devops-agent$ make test
python test_agent.py
2026-05-29 16:36:41,800 - vllm-devops-agent - INFO - Initializing DevOps Agent MCP Server...
.........2026-05-29 16:36:41,850 - vllm-devops-agent - INFO - Querying Cloud Run model with prompt: 'Hello...'
2026-05-29 16:36:41,850 - vllm-devops-agent - INFO - Model response: 'Response from Gemma...'
.2026-05-29 16:36:41,851 - vllm-devops-agent - INFO - Querying model with stats with prompt: 'Hello...'
2026-05-29 16:36:41,851 - vllm-devops-agent - INFO - Model response with stats: TTFT=0.000s, TotalTime=0.000s
.......
----------------------------------------------------------------------
Ran 17 tests in 0.027s

OK
xbill@penguin:~/gemma4-tips/gpu-31B-6000-devops-agent$ 
Enter fullscreen mode Exit fullscreen mode

MCP stdio Transport

One of the key features that the MCP protocol provides is abstracting various transport methods.

The high level tool MCP implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.

The simplest transport that the SDK supports is the stdio (stdio/stdout) transportβ€Šβ€”β€Šwhich connects a locally running process. Both the MCP client and MCP Server must be running in the same environment.

In this project Antigravity CLI is used as the MCP client to interact with the Python MCP server code.

Antigravity CLI mcp_config.json

A sample MCP server file is provided in the .agents directory:

{
  "mcpServers": {
    "gpu-31b-6000-devops-agent": {
      "command": "python3",
      "args": [
        "/home/xbill/gemma4-tips/gpu-31B-6000-devops-agent/server.py"
      ],
      "env": {
        "GOOGLE_CLOUD_PROJECT": "aisprint-491218",
        "GOOGLE_CLOUD_LOCATION": "us-central1",
        "VLLM_BASE_URL": "https://gpu-31B-6000-devops-agent-289270257791.us-central1.run.app",
        "MODEL_NAME": "/mnt/models/gemma-4-31B-it"
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Validation with Antigravity CLI

The final connection test uses Antigravity CLI as a MCP client with the Python code providing the MCP server:

MCP Servers

Plugins (~/.gemini/antigravity-cli/plugins)
> βœ“ google-dev-knowledge Tools: search_documents, answer_query, get_documents
   βœ“ gpu-31b-6000-devops-agent Tools: save_hf_token, get_vllm_endpoint, list_vertex_models, list_bucket_models,
                                analyze_cloud_logging, +19 more
Enter fullscreen mode Exit fullscreen mode

Getting Started with Gemma 4 on GPU

The Official vLLM repo also has Gemma4 specific information:

Releases Β· vllm-project/vllm

Model Lifecycle Management via MCP

The MCP tools provide a complete suite of agent-oriented operations for managing vLLM deployment on Cloud Run or a TPU.

Overview of MCP tools :

> list mcp tools

● gpu-31b-6000-devops-agent/get_help(Get help details from agent) (ctrl+o to expand)

  Here is the list of available Model Context Protocol (MCP) tools provided by the gpu-31b-6000-devops-agent server:

  ### 🐳 Infrastructure & Deployment

  β€’ deploy_vllm : Deploys vLLM to Cloud Run GPU (NVIDIA RTX PRO 6000 in us-central1 ).
  β€’ destroy_vllm : Deletes the Cloud Run vLLM service.
  β€’ status_vllm : Checks the status of the Cloud Run vLLM service.
  β€’ update_vllm_scaling : Updates min/max instances for scaling.
  β€’ get_vllm_deployment_config : Generates the gcloud deployment command.
  β€’ get_vllm_gpu_deployment_config : Generates a GKE manifest for GPU (NVIDIA L4).
  β€’ check_gpu_quotas : Checks RTX 6000 and other GPU quotas for a region.

  ### πŸ“¦ Model Management

  β€’ list_vertex_models : Lists models in the Vertex AI Registry.
  β€’ list_bucket_models : Lists model weights in the GCS bucket.
  β€’ save_hf_token : Securely saves a Hugging Face API token to Secret Manager.
  β€’ get_vertex_ai_model_copy_instructions : Returns instructions to copy a model from Vertex AI Model Garden to GCS.
  β€’ get_huggingface_model_copy_instructions : Returns instructions to download a model from Hugging Face and upload it
  to GCS.
  β€’ get_huggingfacehub_download_path : Resolves the local cache path using huggingface_hub .

  ### πŸ“Š Monitoring & Status

  β€’ get_system_status : Provides a high-level status dashboard of the Cloud Run service and health.
  β€’ get_endpoint : Verifies connectivity and returns the active service URL.
  β€’ get_model_details : Retrieves detailed model metadata and engine state from /v1/models .
  β€’ verify_model_health : Deep health check by querying the model with a simple prompt and measuring latency.

  ### πŸ“ˆ Performance & Benchmarking

  β€’ run_benchmark : Runs performance/concurrency benchmark sweeps against the Cloud Run vLLM GPU endpoint.

  ### πŸ’¬ Interaction & Diagnostics

  β€’ query_gemma4 : Primary tool to query the self-hosted model with standard chat message format.
  β€’ query_gemma4_with_stats : Queries the model and returns streaming performance statistics (TTFT, throughput).
  β€’ query_vllm : Direct text completions querying tool.
  β€’ analyze_cloud_logging : Fetches logs from GCP Logging and analyzes them using the model.
  β€’ analyze_gpu_logs : Fetches Cloud Run logs and uses Gemma 4 to analyze them for SRE/DevOps errors.
  β€’ suggest_sre_remediation : Suggests remediation plans for SRE errors using the model.
  β€’ get_help : Retrieves the help menu, active environment configuration, and tool mappings.

Enter fullscreen mode Exit fullscreen mode

Deploy The Model

The MCP server tools are used to deploy the model to Cloud Run with TPU:

  The compilation step and warm-up run have successfully completed:

  β€’ torch.compile took 67.21 seconds.
  β€’ FlashInfer JIT autotuning completed.
  β€’ CUDA graph memory profiling is complete (5/5 CUDA graphs captured).

  The engine is now starting up the OpenAI API router. Port 8080 will be open and the revision will be marked active
  within seconds.

● Bash(make query PROMPT="Hello Gemma 4!") (ctrl+o to expand)

  I have initiated a test query ( make query ) against the new endpoint to verify the vLLM serving stack is fully
  functional and responding to prompts. I'll notify you as soon as the response arrives.

  The deployment is fully complete and functional!

  Here is the final summary:

  β€’ Model Copy: All weights and tokenizer configurations for google/gemma-4-31B-it (totaling 58.28 GiB) were
  successfully transferred from Hugging Face directly to your Google Cloud Storage bucket ( gs://aisprint-491218-      
  bucket/gemma-4-31B-it/ ).
  β€’ Deployment: The Cloud Run GPU service was successfully deployed under the name gpu-31b-6000-devops-agent pointing
  to the newly uploaded 31B model folder.
  β€’ Initialization: The vLLM serving stack initialized, loaded the 31B model using the Model Streamer, successfully
  performed torch.compile graph optimization, and compiled CUDA graphs.
  β€’ Verification: A test query was sent to the deployed Cloud Run endpoint and returned a successful completions
  response structure showing full connectivity and model execution.

  The endpoint is live and serving traffic at:
   https://gpu-31b-6000-devops-agent-289270257791.us-central1.run.app 
Enter fullscreen mode Exit fullscreen mode

Checking System status

The status can be checked with an MCP tool:

> get_system_status

● gpu-31b-6000-devops-agent/get_system_status(Retrieve system status details) (ctrl+o to expand)

  The system status has been successfully retrieved:

  ### πŸŒ€ GPU Cloud Run System Status

  β€’ vLLM Health: 🟒 Online (vLLM Endpoint https://gpu-31B-6000-devops-agent-289270257791.us-central1.run.app)
  β€’ Cloud Run Service Status: 🟒 Ready

  Next Step: You can use the query_gemma4 tool to interact with the model.
Enter fullscreen mode Exit fullscreen mode

Cross Check The Deployed Model

Once the model starts locally- the MCP tool allows for model verification:

> verify_model_health

● gpu-31b-6000-devops-agent/verify_model_health(Verify model health status) (ctrl+o to expand)

  The model health check has passed successfully:

  β€’ Status: βœ… PASSED
  β€’ Model: gpu-31b-6000-devops-agent 
  β€’ Response: "Yes, I am working! How can I help you today?..."
  β€’ Latency: 0.44 seconds

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>
Enter fullscreen mode Exit fullscreen mode

and model details:

> get model details

● gpu-31b-6000-devops-agent/get_model_details(Retrieve details about the serving model) (ctrl+o to expand)

  The model details have been retrieved successfully:

  ### 🧩 Model Details (vLLM Endpoint https://gpu-31B-6000-devops-agent-289270257791.us-central1.run.app)

  Model Information ( /v1/models ):

    [
      {
        "id": "gpu-31b-6000-devops-agent",
        "object": "model",
        "owned_by": "vllm"
      }
    ]

  Health Status ( /health ):

  β€’ Status: Healthy βœ…
Enter fullscreen mode Exit fullscreen mode

and the endpoint:

> get endpoint

● gpu-31b-6000-devops-agent/get_endpoint(Retrieve the endpoint URL of the serving stack) (ctrl+o to expand)

  The Cloud Run vLLM endpoint is online at:

  🟒 Endpoint URL: https://gpu-31B-6000-devops-agent-289270257791.us-central1.run.app
Enter fullscreen mode Exit fullscreen mode

Check the Cloud Run deployment status:

> show the cloud run docker status

● gpu-31b-6000-devops-agent/status_vllm(Get status of vLLM Cloud Run service)
● Search(Search for status_vllm)
● Read(/home/xbill/gemma4-tips/gpu-31B-6000-devops-agent/server.py)
● Bash(gcloud run services describe gpu-31b-6000-devops-agent --project=aisprint-491218 --region=us-c...) (ctrl+o to
expand)

  The container (Docker) configuration and status for the Cloud Run service gpu-31b-6000-devops-agent are:

  β€’ Docker Image: vllm/vllm-openai:latest 
  β€’ Port: 8080 (HTTP)
  β€’ Resource Limits:
      β€’ CPU: 20 Cores
      β€’ Memory: 80 GiB
      β€’ GPU: 1 NVIDIA GPU (RTX Pro 6000)
  β€’ Volume Mounts: /mnt/models mounted to model-volume 
  β€’ Startup Probe: Starts checking after 240 seconds, with a 15-second period and 120 failure threshold (designed to
  accommodate large model loading time).
  β€’ Entrypoint / Args:
    bash -c "vllm serve /mnt/models/gemma-4-31B-it \
      --served-model-name gpu-31b-6000-devops-agent \
      --enable-log-requests \
      --enable-chunked-prefill \
      --enable-prefix-caching \
      --generation-config auto \
      --enable-auto-tool-choice \
      --tool-call-parser gemma4 \
      --reasoning-parser gemma4 \
      --dtype bfloat16 \
      --quantization fp8 \
      --kv-cache-dtype fp8 \
      --max-num-seqs 8 \
      --gpu-memory-utilization 0.95 \
      --tensor-parallel-size 1 \
      --load-format runai_streamer \
      --port 8080 \
      --host 0.0.0.0 \
      --max-model-len 32767"

Enter fullscreen mode Exit fullscreen mode

Check the Cloud Run Service

Once Deployedβ€Šβ€”β€Šthe model is visible as a standard Cloud Run Service:

Review the Model

Check the model details:

> what are the model details?

● gpu-31b-6000-devops-agent/get_model_details(Retrieve details about the serving model) (ctrl+o to expand)

  The credentials have been refreshed successfully! Here are the model details:

  ### 🧩 Model Details (vLLM Endpoint https://gpu-31B-6000-devops-agent-289270257791.us-central1.run.app)

  Model Information ( /v1/models ):

    [
      {
        "id": "gpu-31b-6000-devops-agent",
        "object": "model",
        "owned_by": "vllm"
      }
    ]

  Health Status ( /health ):

  β€’ Status: Healthy βœ…
Enter fullscreen mode Exit fullscreen mode

Once everything is onlineβ€Šβ€”β€Šyou can send queries to the Model:

> query_gemma4_with_stats what is gemma?

● gpu-31b-6000-devops-agent/query_gemma4_with_stats(Query model for Gemma info and return statistics) (ctrl+o to expand)

  Here are the performance statistics and the model's response:

  ### πŸ“Š Performance Stats

  β€’ Model: gpu-31b-6000-devops-agent 
  β€’ Time to First Token (TTFT): 0.093s 
  β€’ Total Generation Time: 15.658s 
  β€’ Tokens per Second: 39.13 tokens/s 
  β€’ Total Tokens (approx.): 609 

  ### πŸ’¬ Model Response

  Gemma is a family of lightweight, state-of-the-art open-weights large language models (LLMs) developed by Google     
  DeepMind.

  Think of Gemma as the "smaller, open sibling" of Gemini, Google's most powerful AI. While Gemini is a closed system
  available via API or chat interface, Gemma is designed to be downloaded and run on your own hardware.

  Here is a detailed breakdown of what makes Gemma unique:

  ### 1. "Open-Weights" vs. "Open Source"

  It is important to note that Gemma is open-weights, not fully open-source.

  β€’ Open-Source usually means the entire training data and code are public.
  β€’ Open-Weights means Google provides the "brain" (the trained parameters) for free, allowing developers to run it
  locally and fine-tune it for specific tasks, but the exact training data remains proprietary.

  ### 2. Relationship to Gemini

  Gemma was built using the same research, technology, and infrastructure used to create the Gemini models. This means
  it inherits much of the reasoning, coding, and mathematics capability of the larger Gemini models, but in a package
  that is small enough to fit on a personal laptop or a workstation.

  ### 3. Key Versions

  Gemma typically comes in different sizes to suit different hardware:

  β€’ Gemma 2B: Extremely small; can run on a mobile phone or a basic laptop.
  β€’ Gemma 9B / 27B: Mid-sized models that offer a balance between high intelligence and performance. The 27B version is
  designed to compete with much larger models (like Llama 3) while remaining efficient.

  ### 4. What can you do with Gemma?

  Because you can host Gemma on your own machine, it is popular for:

  β€’ Privacy: Processing sensitive data without sending it to a cloud server.
  β€’ Customization (Fine-Tuning): A developer can take Gemma and "teach" it a specific medical dataset or a company's
  internal documentation to make it an expert in a niche field.
  β€’ Research: Academics use it to study how LLMs work without needing a supercomputer.
  β€’ Edge Computing: Running AI locally on devices that don't have a constant internet connection.

  ### 5. How does it compare to others?

  Gemma is Google's direct competitor to Meta's Llama (e.g., Llama 3) and Mistral AI's Mistral/Mixtral models. All
  three are fighting to be the "standard" model that developers use to build their own AI applications.

  In short: Gemma is Google's high-performance, portable AI model that allows developers to build and run their own AI 
  apps privately and locally.

Enter fullscreen mode Exit fullscreen mode

Benchmark the Model

The model was benchmarked:

The benchmark sweep has completed successfully. Here are the compiled results across all combinations of context
  window sizes and concurrency levels:

  The benchmark sweep across 1 to 256 concurrent users and 8 to 16,384 context tokens has successfully completed.

  ### πŸ“‚ Generated Files

  β€’ Performance Report: benchmark_report.md
  β€’ Raw CSV Results: matrix_benchmark_results.csv
  β€’ Performance Chart: rtx_6000_benchmark.png
  β€’ Conversation Artifact: benchmark_report.md

  ### πŸ“ˆ Key Results Summary

  β€’ Continuous Batching Efficiency: The vLLM engine scales exceptionally well under load. Under maximum load (256      
  concurrent users with 16,384 context window), the token generation rate peaked at ~344.96 tokens/sec.
  β€’ Latency Profile: Latency is highly predictable, starting at sub-second/single-digit latency for lower concurrency
  profiles, scaling linearly to approximately 41.43s for 256 concurrent users at 16,384 tokens.
  β€’ Warmup/Cold Starts: During the transition to the 512 token context sweep, a container restart and graph
  compilation occurred, resulting in temporarily lower throughput before graph caching took over and returned the
  system to optimal speed.
Enter fullscreen mode Exit fullscreen mode

The analysis is reviewed:

  ### πŸ”‘ Key Takeaways

  1. Continuous Batching Efficiency ($C=16$ Sweet Spot):
  Scaling concurrency from 1 to 16 increases total token throughput by almost 8x (jumping from ~38 tok/s to ~300
  tok/s) while average latency only increases slightly (from 3.32s to 5.17s).
  2. Throughput Saturation:
  The RTX 6000 Pro's token throughput caps out at ~310–345 tokens/sec. Pushing concurrency past 32 up to 256 does
  not yield further throughput gains, but instead results in linear queuing latency (latency reaches up to ~54.87s).
  3. Prefix Caching Phenomenon (16,384 tokens):
  Under high load, large context windows (16k tokens) actually completed faster than short context windows (avg latency
  of 41.43s vs 54.87s at 256 concurrency). This is due to vLLM's automatic prefix caching which evaluates the large
  prompt KV cache once and shares it, reducing prefill time to $O(1)$ for subsequent concurrent requests.

  ### πŸ› οΈ SRE Configuration Recommendations

  β€’ Cloud Run Concurrency: Limit instance concurrency to --concurrency=16 or --concurrency=32 to trigger auto-
  scaling to new instances rather than queuing on a single saturated GPU.
  β€’ Prefix Caching: Always run with the --enable-prefix-caching flag enabled to optimize shared prompt contexts (e.g.
  system instructions, repeated log dumps).
  β€’ Startup Delays: Set startup timeouts to at least 240 seconds to accommodate container cold-starts, safetensors
  downloading, and torch.compile graph optimization.
Enter fullscreen mode Exit fullscreen mode

This is visualized:

Compare to TPU Deployment

The same model was deployed on a v6e-4 TPU for comparison:

  I have generated a side-by-side performance comparison between the RTX 6000 Pro GPU and Cloud TPU v6e-4.

  πŸ“‚ Detailed Comparison Report: gpu_tpu_comparison_report.md
  πŸ–ΌοΈ Visual Comparison Chart: gpu_tpu_comparison.png

  ### πŸ“Š Key Performance Differences

  1. Peak Throughput (TPU Dominance):
      β€’ For short contexts (8 tokens), TPU v6e-4 reaches ~1,379 tokens/sec compared to the GPU's ~308 tokens/sec (4.5x 
      faster).
      β€’ For long contexts (16,384 tokens), TPU v6e-4 achieves ~3,206 tokens/sec compared to the GPU's ~344 tokens/sec
      (9.3x faster).
  2. Latency (Sub-Second Responses):
      β€’ TPU latency stays extremely low under medium load (averaging 0.46s at concurrency 16), whereas the GPU averages
      5.17s for the same level.
  3. Architectural Fit:
      β€’ NVIDIA RTX 6000 Pro GPU (Cloud Run): Perfect for cost-sensitive, low-to-medium concurrency apps, or workflows
      with high idle periods (leveraging Cloud Run's scale-to-zero model).
      β€’ Cloud TPU v6e-4: Essential for high-traffic enterprise gateways, real-time streaming, and massive batch log-
      parsing operations where high context and high concurrency are standard.

πŸ“‚ Architectural Deep Dive Report: gpu_tpu_deep_dive.md

  ### πŸ”¬ Core Hardware Findings

  1. Memory Bandwidth Bottleneck (Decode Phase):
  During token generation (decode), execution is entirely memory-bandwidth bound because the model parameters must be
  read from VRAM for every single token. The TPU v6e-4's HBM2e memory supplies ~6.4 TB/s of bandwidth compared to the
  GPU's 960 GB/s (GDDR6). This 6.6x advantage is why the TPU handles high concurrency without saturating.
  2. Distributed Processing (Tensor Parallelism):
  The TPU v6e-4 config splits the 31B model across 4 chips ( --tensor-parallel-size 4 ) communicating over an ultra-low
  latency Inter-Chip Interconnect (ICI). The RTX 6000 Pro runs on a single card ( --tensor-parallel-size 1 ),
  processing all weights on a single GPU core.
  3. Compiler Efficiency (XLA vs. CUDA):
  TPUs compile the model graph using XLA (Accelerated Linear Algebra), which optimizes matrix operations and minimizes
  operator execution overhead. The GPU relies on CUDA Graphs and torch.compile which, while fast, has higher startup
  graph compilation times and launch latencies.

Enter fullscreen mode Exit fullscreen mode

The results were graphed:

and recommendations generated:

  πŸ“‚ Recommendations Document: production_recommendations.md

  ### πŸš€ Summary of Key Actions

  1. Hardware Strategy:
      β€’ Use RTX 6000 Pro (Cloud Run) for SRE troubleshooting tools, development, and low-traffic applications where
      scale-to-zero saves significant cost.
      β€’ Use Cloud TPU v6e-4 for high-volume enterprise APIs or systems handling massive context windows (16,384+
      tokens) at high concurrency.
  2. Cloud Run Tuning:
      β€’ Set --concurrency=16 or --concurrency=32 to avoid saturating a single GPU, directing excess traffic to auto-
      scaled new instances.
      β€’ Configure --timeout=300 and TCP startup probes to at least 240 seconds to allow for weight streaming and
      graph optimization.
  3. Prefix Caching Alignment:
      β€’ Keep --enable-prefix-caching active.
      β€’ Organize prompts from static (system parameters, database schemas) to dynamic (user query, timestamped logs) to
      maximize prefix cache hit ratios.
Enter fullscreen mode Exit fullscreen mode

Finallyβ€Šβ€”β€Ša basic cost analsys:

  I have performed a cost-per-token analysis comparing the NVIDIA RTX 6000 Pro GPU (Cloud Run) to the Cloud TPU v6e-4
  based on GCP's resource pricing models.

  πŸ“‚ Detailed Cost Analysis Report: cost_per_token_analysis.md

  ### πŸ“Š Cost-per-Million Tokens Comparison (at Concurrency = 16)

  1. Short Context (8 Input Tokens):
      β€’ RTX 6000 Pro GPU: $3.49 per million tokens
      β€’ Cloud TPU v6e-4: $4.99 per million tokens
      β€’ Verdict: GPU is 30% cheaper because the machine-hour cost is lower ($3.76/hr vs $5.50/hr).
  2. Long Context (16,384 Input Tokens):
      β€’ RTX 6000 Pro GPU: $4.24 per million tokens
      β€’ Cloud TPU v6e-4: $0.53 per million tokens
      β€’ Verdict: TPU is 8x cheaper because the TPU v6e-4 leverages HBM2e memory bandwidth to maintain massive
      throughput (2,892 tokens/sec vs 246 tokens/sec on GPU).

  ### πŸ’‘ SRE Recommendation

  β€’ Spiky/Ad-Hoc Workloads (e.g. On-demand log diagnostics): Use the GPU on Cloud Run. It supports scale-to-zero,
  meaning you only pay the $3.76/hr rate while processing requests, saving substantial money over a TPU VM which must
  run constantly (charging $132/day even when idle).
  β€’ Continuous/High-Throughput Production Workloads: Use the Cloud TPU v6e-4. Running TPU instances continuously is
  significantly more cost-efficient for heavy text processing and long-context analysis.
Enter fullscreen mode Exit fullscreen mode

Summary

The strategy for using MCP for Gemma 4 GPU deployment with Antigravity CLI was validated with a incremental step by step approach.

A minimal stdio transport MCP Server was started from Python source code and validated with Antigravity CLI running as a MCP client in the same local environment. This Python server provided all of the management tools to deploy and troubleshoot Cloud Run Model deployments.

Top comments (0)