DEV Community

xbill for Google Developer Experts

Posted on • Originally published at Medium on

Gemma-SRE: Self-Hosted vLLM Infrastructure Agent

Gemma-SRE is a high-performance, private DevOps and Site Reliability Engineering (SRE) assistant designed for the TPU Sprint. It leverages Gemma 2 models, self-hosted via vLLM on Google Cloud Run GPUs (NVIDIA L4) or GKE TPUs (v5e), to bridge the gap between cloud infrastructure and intelligent automation.

What is this project trying to Do?

By integrating directly with Google Cloud Logging and the Vertex AI Model Registry, the SRE agent autonomously analyzes system errors, suggests production-ready remediation plans, and manages its own scalable inference stack. The project features a robust Model-as-Code approach, providing automated deployment templates that include optimized configurations for startup probes, GCS FUSE storage, and XLA-accelerate performance, ensuring a secure and cost-efficient private inference endpoint for any Google Cloud project.

Is this yet another Python MCP Demo?

Yes.

Python has traditionally been the main coding language for ML and AI tools. The goal of this article is to provide a minimal viable vLLM MCP server that can be used with tools such as Gemini CLI or the Agent Development Kit (ADK).

So — What does this Project Actually Do ?

This project is an automated SRE and DevOps assistant that uses a private, self-hosted AI (Gemma) to help you manage and troubleshoot Google Cloud infrastructure for deploying self-hosted models.

Specifically, it performs three main functions:

AI-Driven Troubleshooting

The agent connects to Google Cloud Logging to identify errors in your environment. It uses a self-hosted vLLM inference server to:

  • Summarize Logs: Turn raw, messy error logs into high-level summaries of what is actually failing.
  • Suggest Remediation: Provide expert 3-step fix plans for specific incidents (like memory limits being exceeded or container crashes).

Infrastructure Automation

The project includes a full "Inference Stack" manager. It can:

  • Deploy vLLM to Cloud Run: Automatically generate and execute commands to deploy an NVIDIA L4 GPU-accelerated server
  • Manage Scaling: Adjust min/max instances and check the health of the inference service
  • Generate Configs: Create Kubernetes (GKE) manifests for running vLLM on TPU v5e chips.

Model Weight Management

It bridges the gap between model providers and your private cloud:

  • Vertex AI & Kaggle Integration: Provides guides and commands to pull Gemma model weights from Vertex AI Model Garden or Kaggle
  • GCS FUSE Setup: Handles the complex configuration needed to mount a Google Cloud Storage bucket as a local disk for vLLM, allowing you to run models without manually baking them into container images.

In short: It is an MCP (Model Context Protocol) server that gives an LLM (like Claude or Gemini) the "hands" to see your logs and the "tools" to deploy and manage its own TPU and GPU-powered inference infrastructure.

Python Version Management

One of the downsides of the wide deployment of Python has been managing the language versions across platforms and maintaining a supported version.

The pyenv tool enables deploying consistent versions of Python:

GitHub - pyenv/pyenv: Simple Python version management

As of writing — the mainstream python version is 3.13. To validate your current Python:

xbill@penguin:~$ python --version
Python 3.13.12

xbill@penguin:~$ pyenv version
3.13.12 (set by /home/xbill/.pyenv/version)
Enter fullscreen mode Exit fullscreen mode

Gemini CLI

If not pre-installed you can download the Gemini CLI to interact with the source files and provide real-time assistance:

npm install -g @google/gemini-cli
Enter fullscreen mode Exit fullscreen mode

Testing the Gemini CLI Environment

Once you have all the tools and the correct Node.js version in place- you can test the startup of Gemini CLI. You will need to authenticate with a Key or your Google Account:

gemini

  ▝▜▄ Gemini CLI v0.35.0
    ▝▜▄
   ▗▟▀ Signed in with Google: xbill9@gmail.com /auth
  ▝▀ Plan: Gemini Code Assist in Google One AI Pro /upgrade
Enter fullscreen mode Exit fullscreen mode

Node Version Management

Gemini CLI needs a consistent, up to date version of Node. The nvm command can be used to get a standard Node environment:

GitHub - nvm-sh/nvm: Node Version Manager - POSIX-compliant bash script to manage multiple active node.js versions

Python MCP Documentation

The official GitHub Repo provides samples and documentation for getting started:

GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients

Where do I start?

The strategy for starting MCP development for vLLM management is a incremental step by step approach.

First, the basic development environment is setup with the required system variables, and a working Gemini CLI configuration.

Then, a minimal Python MCP Server is built with stdio transport. This server is validated with Gemini CLI in the local environment.

This setup validates the connection from Gemini CLI to the local server via MCP. The MCP client (Gemini CLI) and the Python MCP server both run in the same local environment.

Setup the Basic Environment

At this point you should have a working Python environment and a working Gemini CLI installation. The next step is to clone the GitHub samples repository with support scripts:

cd ~
git clone https://github.com/xbill9/aisprintapr2026
Enter fullscreen mode Exit fullscreen mode

Then run init.sh from the cloned directory.

The script will attempt to determine your shell environment and set the correct variables:

cd aisprintapr2026
source init.sh
Enter fullscreen mode Exit fullscreen mode

If your session times out or you need to re-authenticate- you can run the set_env.sh script to reset your environment variables:

cd aisprintapr2026
source set_env.sh
Enter fullscreen mode Exit fullscreen mode

Variables like PROJECT_ID need to be setup for use in the various build scripts- so the set_env script can be used to reset the environment if you time-out.

vLLM Management Tool with MCP Stdio Transport

One of the key features that the standard MCP libraries provide is abstracting various transport methods.

The high level MCP tool implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.

The simplest transport that the SDK supports is the stdio (stdio/stdout) transport — which connects a locally running process. Both the MCP client and MCP Server must be running in the same environment.

The connection over stdio will look similar to this:

# Initialize FastMCP server
mcp = FastMCP("Self-Hosted vLLM DevOps Agent")
Enter fullscreen mode Exit fullscreen mode

Running the Python Code

First- switch the directory with the Python version of the MCP sample code:

~/aisprintapr2026/self-hosted-vllm-devops-agent
Enter fullscreen mode Exit fullscreen mode

Run the release version on the local system:

xbill@penguin:~/aisprintapr2026/self-hosted-vllm-devops-agent$ make install
Processing ./.

Enter fullscreen mode Exit fullscreen mode

The project can also be linted:

xbill@penguin:~/aisprintapr2026/self-hosted-vllm-devops-agent$ make lint
ruff check .
All checks passed!
ruff format --check .
5 files already formatted
mypy .
Success: no issues found in 5 source files
Enter fullscreen mode Exit fullscreen mode

And a test run:

xbill@penguin:~/aisprintapr2026/self-hosted-vllm-devops-agent$ make test
python test_agent.py
📡 Automatically discovered vLLM at: https://vllm-gemma-2b-it-wgcq55zbfq-uk.a.run.app
..........
----------------------------------------------------------------------
Ran 10 tests in 0.029s

OK
Enter fullscreen mode Exit fullscreen mode

vLLM Interaction with MCP stdio Transport

One of the key features that the MCP protocol provides is abstracting various transport methods.

The high level tool MCP implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.

The simplest transport that the SDK supports is the stdio (stdio/stdout) transport — which connects a locally running process. Both the MCP client and MCP Server must be running in the same environment.

In this project Gemini CLI is used as the MCP client to interact with the Python MCP server code.

Gemini CLI settings.json

Replace the default Gemini CLI configuration file —  settings.json with a pre-configured sample:

{
  "mcpServers": {
    "vllm-devops-agent": {
      "command": "python3",
      "args": ["/home/xbill/aisprintapr2026/self-hosted-vllm-devops-agent/server.py"],
      "env": {
        "GOOGLE_CLOUD_PROJECT": "aisprint-491218",
        "GOOGLE_CLOUD_LOCATION": "us-east4",
        "VLLM_BASE_URL": "https://vllm-gemma-2b-it-wgcq55zbfq-uk.a.run.app",
        "MODEL_NAME": "/mnt/models/gemma-2b-it"
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Validation with Gemini CLI

The final connection test uses Gemini CLI as a MCP client with the Python code providing the MCP server:


  ▝▜▄ Gemini CLI v0.35.0
    ▝▜▄
   ▗▟▀ Signed in with Google: xbill9@gmail.com /auth
  ▝▀ Plan: Gemini Code Assist in Google One AI Pro /upgrade
Enter fullscreen mode Exit fullscreen mode

Run the environment script to configure the correct variables:

 > /mcp list
Configured MCP servers:

🟢 vllm-devops-agent - Ready (15 tools, 1 resource)
  Tools:
  - mcp_vllm-devops-agent_analyze_cloud_logging
  - mcp_vllm-devops-agent_deploy_vllm
  - mcp_vllm-devops-agent_destroy_vllm
  - mcp_vllm-devops-agent_get_kaggle_model_copy_instructions
  - mcp_vllm-devops-agent_get_kagglehub_download_path
  - mcp_vllm-devops-agent_get_vertex_ai_model_copy_instructions
  - mcp_vllm-devops-agent_get_vllm_deployment_config
  - mcp_vllm-devops-agent_get_vllm_endpoint
  - mcp_vllm-devops-agent_get_vllm_tpu_deployment_config
  - mcp_vllm-devops-agent_list_bucket_models
  - mcp_vllm-devops-agent_list_vertex_models
  - mcp_vllm-devops-agent_query_vllm
  - mcp_vllm-devops-agent_status_vllm
  - mcp_vllm-devops-agent_suggest_sre_remediation
  - mcp_vllm-devops-agent_update_vllm_scaling
  Resources:
  - get_deployment_template (config://vllm-deployment-template) [text/plain]
Enter fullscreen mode Exit fullscreen mode

vLLM Lifecycle Management via MCP

The MCP tools provide a complete suite of agent-oriented operations for managing vLLM deployment on Cloud Run or a TPU.

Overview of Tools

## 🛠 Available Tools

The following tools are available via the MCP server:

### Infrastructure Management
* **`deploy_vllm`** : Deploys a new vLLM instance on Cloud Run GPU.
* **`destroy_vllm`** : Deletes the Cloud Run vLLM service.
* **`status_vllm`** : Checks the health and status of the vLLM service.
* **`update_vllm_scaling`** : Adjusts the minimum and maximum instances for scaling.
* **`get_vllm_deployment_config`** : Generates `gcloud` commands for deployment.

### Model Management
* **`list_vertex_models`** : Lists models in the Vertex AI Registry.
* **`list_bucket_models`** : Lists model artifacts in your GCS bucket.
* **`get_vertex_ai_model_copy_instructions`** : Guide to transfer Gemma models from Vertex AI.
* **`get_kaggle_model_copy_instructions`** : Guide to transfer Gemma models from Kaggle.

### AI & Operations
* **`analyze_cloud_logging`** : Summarizes error logs using self-hosted vLLM.
* **`suggest_sre_remediation`** : Provides 3-step plans for SRE incidents.
* **`query_vllm`** : Sends direct prompts to the self-hosted Gemma model.
* **`get_vllm_tpu_deployment_config`** : Generates GKE manifests for TPU v5e.
Enter fullscreen mode Exit fullscreen mode

mcp_vllm-devops-agent_deploy_vllm - deploy vLLM

This tool handles the deployment of the LLM:

✦ I will deploy the vLLM service using the default configuration (vllm-gemma-2b-it model from the aisprint-491218-bucket).
✦ I will check the status of the Cloud Run service using make status to determine if the deployment is still in progress or if it failed.
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ ✓ Shell make status [current working directory /home/xbill/aisprintapr2026/self-hosted-vllm-devops-agent] (Check if the deployment i… │
│ │
│ 🔍 Checking status of Cloud Run service vllm-gemma-2b-it in us-east4... │
│ gcloud run services describe vllm-gemma-2b-it \ │
│ --project=aisprint-491218 \ │
│ --region=us-east4 │
│ ERROR: (gcloud.run.services.describe) Cannot find service [vllm-gemma-2b-it] │
│ make: *** [Makefile:63: status] Error 1 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
✦ I will list the available GCS buckets in the project to verify that the target bucket exists.
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ ✓ Shell gcloud storage buckets list --project=aisprint-491218 [current working directory /home/xbill/aisprintapr2026/self-hosted-vll… │
│ │
│ ... first 37 lines hidden (Ctrl+O to show) ... │
│ location_type: region │
│ metageneration: 1 │
│ name: aisprint-491218-bucket │
│ public_access_prevention: inherited │
│ satisfies_pzs: true │
│ soft_delete_policy: │
│ effectiveTime: '2026-03-25T00:31:13.502000+00:00' │
│ retentionDurationSeconds: '604800' │
│ storage_url: gs://aisprint-491218-bucket/ │
│ uniform_bucket_level_access: false │
│ update_time: 2026-03-25T00:31:13+0000 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
✦ I will list the contents of the gemma-2b-it folder in the GCS bucket to ensure the model artifacts are present.
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ ✓ Shell gcloud storage ls gs://aisprint-491218-bucket/gemma-2b-it/ [current working directory /home/xbill/aisprintapr2026/self-hoste… │
│ │
│ gs://aisprint-491218-bucket/gemma-2b-it/config.json │
│ gs://aisprint-491218-bucket/gemma-2b-it/gemma-2b-it.gguf │
│ gs://aisprint-491218-bucket/gemma-2b-it/generation_config.json │
│ gs://aisprint-491218-bucket/gemma-2b-it/model-00001-of-00002.safetensors │
│ gs://aisprint-491218-bucket/gemma-2b-it/model-00002-of-00002.safetensors │
│ gs://aisprint-491218-bucket/gemma-2b-it/model.safetensors.index.json │
│ gs://aisprint-491218-bucket/gemma-2b-it/special_tokens_map.json │
│ gs://aisprint-491218-bucket/gemma-2b-it/tokenizer.json │
│ gs://aisprint-491218-bucket/gemma-2b-it/tokenizer.model │
│ gs://aisprint-491218-bucket/gemma-2b-it/tokenizer_config.json │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
✦ I will now run the deployment command manually using make deploy to observe the output and any potential error messages.
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ ✓ Shell make deploy [current working directory /home/xbill/aisprintapr2026/self-hosted-vllm-devops-agent] (Execute the deployment di… │
│ │
│ ... first 18 lines hidden (Ctrl+O to show) ... │
│ --add-volume-mount=volume=model-volume,mount-path=/mnt/models \ │
│ --args=--model=/mnt/models/gemma-2b-it,--max-model-len=4096,--trust-remote-code,--gpu-memory-utilization=0.9,--host=0.0.0.0 \ │
│ --no-allow-unauthenticated \ │
│ --region=us-east4 │
│ Deploying container to Cloud Run service [vllm-gemma-2b-it] in project [aisprint-491218] region [us-east4] │
│ ✓ Deploying new service... Done. │
│ ✓ Creating Revision... │
│ ✓ Routing traffic... │
│ Done. │
│ Service [vllm-gemma-2b-it] revision [vllm-gemma-2b-it-00001-zqc] has been deployed and is serving 100 percent of traffic. │
│ Service URL: https://vllm-gemma-2b-it-289270257791.us-east4.run.app │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
✦ I will now check the status of the newly deployed Cloud Run service using make status.
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ ✓ Shell make status [current working directory /home/xbill/aisprintapr2026/self-hosted-vllm-devops-agent] (Verify the status of the … │
│ │
│ ... first 35 lines hidden (Ctrl+O to show) ... │
│ Service account: 289270257791-compute@developer.gserviceaccount.com │
│ Concurrency: 4 │
│ Max instances: 1 │
│ Timeout: 3600s │
│ CPU Allocation: CPU is always allocated │
│ Execution Environment: Second Generation │
│ Volumes: │
│ model-volume │
│ type: cloud-storage │
│ bucket: aisprint-491218-bucket │
│ read-only: True │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
✦ The vLLM service vllm-gemma-2b-it has been successfully deployed and is now active at:
  URL: https://vllm-gemma-2b-it-289270257791.us-east4.run.app

  Summary of deployment:
   - Model: gemma-2b-it (loaded from gs://aisprint-491218-bucket/gemma-2b-it)
   - GPU: 1x NVIDIA L4
   - Resources: 4 CPU, 16Gi Memory
   - Status: Active and serving 100% of traffic.

  You can now use the status_vllm tool (or make status) at any time to check its health and configuration.
                                                                                                                              ? for shortcuts 
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Enter fullscreen mode Exit fullscreen mode

mcp_vllm-devops-agent_destroy_vllm

This tool handles clean-up and de-allocation of the vLLM.

**mcp_vllm-devops-agent_get_kaggle_model_copy_instructions

mcp_vllm-devops-agent_get_kagglehub_download_path

mcp_vllm-devops-agent_get_vertex_ai_model_copy_instructions

mcp_vllm-devops-agent_list_bucket_models

mcp_vllm-devops-agent_list_vertex_models**

These Tools provide instructions on downloading Model files and available Models.

mcp_vllm-devops-agent_get_vllm_deployment_config

This tool generates the deployment parameters.

 > mcp_vllm-devops-agent_get_vllm_deployment_config
✦ I will generate the deployment configuration for vLLM on Cloud Run using the default settings.

✦ The deployment command for the vllm-gemma-2b-it service on Cloud Run is:

    1 gcloud beta run deploy vllm-gemma-2b-it \
    2 --image=vllm/vllm-openai:latest \
    3 --gpu=1 \
    4 --gpu-type=nvidia-l4 \
    5 --no-gpu-zonal-redundancy \
    6 --no-cpu-throttling \
    7 --concurrency=4 \
    8 --timeout=3600 \
    9 --startup-probe=timeoutSeconds=60,periodSeconds=60,failureThreshold=10,initialDelaySeconds=180,httpGet.port=8000,httpGet.path=/health
      \
   10 --max-instances=1 \
   11 --min-instances=0 \
   12 --port=8000 \
   13 --memory=16Gi \
   14 --cpu=4 \
   15 --execution-environment=gen2 \
   16 --add-volume=name=model-volume,type=cloud-storage,bucket=aisprint-bucket,readonly=true \
   17 --add-volume-mount=volume=model-volume,mount-path=/mnt/models \
   18 --args=--model=/mnt/models/gemma-2b-it,--max-model-len=4096,--trust-remote-code,--gpu-memory-utilization=0.9,--host=0.0.0.0 \
   19 --no-allow-unauthenticated \
   20 --region=us-east4

Enter fullscreen mode Exit fullscreen mode

mcp_vllm-devops-agent_get_vllm_endpoint

This tool gets the endpoint for the vLLM deployment.

 > get_vllm_endpoint
✦ I will retrieve the current active vLLM endpoint URL for the vllm-gemma-2b-it service.
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ ✓ get_vllm_endpoint (vllm-devops-agent MCP Server) {"service_name":"vllm-gemma-2b-it"} │
│ │
│ https://vllm-gemma-2b-it-wgcq55zbfq-uk.a.run.app │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
✦ The current active vLLM endpoint URL is:
  https://vllm-gemma-2b-it-wgcq55zbfq-uk.a.run.app
                                                                                                                              ? for shortcuts
Enter fullscreen mode Exit fullscreen mode

mcp_vllm-devops-agent_status_vllm

This tool checks the status of the LLM.

  > agent_status_vllm
✦ I will check the status of the vllm-gemma-2b-it Cloud Run service.
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ ✓ status_vllm (vllm-devops-agent MCP Server) {"service_name":"vllm-gemma-2b-it"} │
│ │
│ Status for vllm-gemma-2b-it: │
│ status: │
│ conditions: │
│ - lastTransitionTime: '2026-03-25T22:31:53.860782Z' │
│ status: 'True' │
│ type: Ready │
│ - lastTransitionTime: '2026-03-25T19:00:51.414639Z' │
│ status: 'True' │
│ type: ConfigurationsReady │
│ - lastTransitionTime: '2026-03-25T22:31:53.832098Z' │
│ status: 'True' │
│ type: RoutesReady │
│ latestCreatedRevisionName: vllm-gemma-2b-it-00001-zqc │
│ url: https://vllm-gemma-2b-it-wgcq55zbfq-uk.a.run.app │
│ │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
✦ The vllm-gemma-2b-it service is Ready.

   - URL: https://vllm-gemma-2b-it-wgcq55zbfq-uk.a.run.app
   - Latest Revision: vllm-gemma-2b-it-00001-zqc
   - Status: All conditions (Ready, ConfigurationsReady, RoutesReady) are True.
Enter fullscreen mode Exit fullscreen mode

mcp_vllm-devops-agent_query_vllm

This tool formats and queries the deployed LLM.

 > mcp_vllm-devops-agent_query_vllm why is the sky blue?
✦ I will query the vLLM model to explain why the sky is blue.
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ ✓ query_vllm (vllm-devops-agent MCP Server) {"prompt":"why is the sky blue?"} │
│ │
│ vLLM Response │
│ │
│ The sky appears blue due to Rayleigh scattering. Rayleigh scattering is the scattering of light waves by particles of a similar size │
│ to the wavelength of light. This means that blue light, with a shorter wavelength, is scattered more strongly than longer wavelengths. │
│ │
│ The blue light scattered by the air molecules is scattered in all directions, but it is scattered more strongly in the direction of │
│ the sun. This is because the sun's light has a longer path to travel through the air than the longer wavelengths of light. │
│ │
│ As a result, the sky appears blue. │
│ │
│ Rayleigh scattering is a very important phenomenon in the atmosphere. It is responsible for the blue color of the sky, as well as for │
│ the blue colors of many other objects, such as the ocean and the eyes of birds. │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
✦ The vLLM model explains that the sky appears blue due to Rayleigh scattering, where air molecules scatter shorter wavelengths of light (like
  blue) more strongly in all directions than longer wavelengths.
Enter fullscreen mode Exit fullscreen mode

mcp_vllm-devops-agent_analyze_cloud_logging

This tool provides visibility into the model logs.

✦ I have completed the log analysis for your self-hosted vLLM deployment.

  📋 Cloud Logging Analysis Summary

  Recent logs revealed three primary categories of events:

   1. Resolved Container Startup Failures (Port Misconfiguration):
      Earlier logs from 2026-03-25T02:38Z show that the vllm-gemma-2b-it service failed to start because it was trying to listen on the
  default Cloud Run port (8080), while the vLLM container is configured to use port 8000. This caused the default startup TCP probe to fail
  with DEADLINE_EXCEEDED.

   2. Idempotency Errors (Bucket Creation):
      Several ERROR logs from 16:28Z and 18:25Z indicate that automated scripts attempted to create the GCS bucket aisprint-491218-bucket when
  it already existed. These errors are non-critical and purely informational regarding resource idempotency.

   3. Current Deployment Status (Healthy):
      The service was successfully redeployed at 22:31Z (Revision: vllm-gemma-2b-it-00001-zqc). It is now correctly configured to use:
       * Port: 8000
       * Health Check: /health on port 8000
       * Scaling: min-instances: 1 (Warm GPU instance)

  🚀 Verification
  I have verified that the endpoint is functional:
   - Status: Ready
   - URL: https://vllm-gemma-2b-it-wgcq55zbfq-uk.a.run.app
   - Test Query: Successfully responded with "The capital of France is Paris."

  The system is currently stable and performing as expected.
Enter fullscreen mode Exit fullscreen mode

mcp_vllm-devops-agent_suggest_sre_remediation

This tool reviews the logs and provides suggestions on fixing deployment issues.

✦ I have successfully used the suggest_sre_remediation tool to analyze the container startup failure (PORT=8080 timeout error).

  🩺 SRE Remediation Plan for Container Startup Failure

  1. Root Cause Identification
   * Log Correlation: Check vllm-gemma-2b-it logs for a mismatch between the container's listening port and the Cloud Run default port.
   * Port Check: Confirm if vLLM is using its default port (8000) while the Cloud Run environment expects port 8080.
   * Timeout Check: Verify if the startup-probe timeout is sufficient for the vLLM model (Gemma-2b-it) to load into GPU memory.

  2. Corrective Actions
   * Port Alignment: Update the Cloud Run service configuration to listen on port 8000 or explicitly set the --port argument in the vLLM
     command.
   * Timeout Extension: Increase the initialDelaySeconds and timeoutSeconds for the startupProbe to account for large model weight loading
     from GCS FUSE.
   * Scaling Check: Ensure that gpu-memory-utilization is not set too high (default 0.9), as this can cause OOM errors during initialization.

  3. Verification & Validation
   * Redeploy: Apply the corrected configuration using gcloud run services update.
   * Health Probe: Verify the /health endpoint returns 200 OK on port 8000.
   * Uptime Monitoring: Confirm the service stays Ready after the initial 180s delay.

  The service is currently healthy after the latest redeployment with the corrected port settings.
Enter fullscreen mode Exit fullscreen mode

mcp_vllm-devops-agent_update_vllm_scaling

This tool allows model scaling.

 > mcp_vllm-devops-agent_update_vllm_scaling
✦ I can update the scaling configuration for your vLLM service. Currently, it is set to min-instances: 1 and max-instances: 1.

  What would you like to set the new limits to?

   * Min Instances: (e.g., 0 to save costs when idle, or 1 to avoid cold starts)
   * Max Instances: (e.g., 1 to stay within GPU quota, or higher if available)
Enter fullscreen mode Exit fullscreen mode

mcp_vllm-devops-agent_get_vllm_tpu_deployment_config

This MCP Tool generates the required deployment settings to migrate the vLLM from Cloud Run with TPU to a dedicated TPU:

✦ 🌀 vLLM on TPU v5e (GKE Deployment)

  To deploy vLLM on TPUs, use the following GKE manifest. This configuration targets a TPU v5e-8 (8 chips), which is ideal for Gemma 2 9B or
  27B.

  1. Create a TPU Node Pool (if not exists)

   1 gcloud container node-pools create tpu-v5e-8 \
   2 --cluster=tpu-cluster \
   3 --location=us-east4 \
   4 --machine-type=ct5lp-hightpu-4t \
   5 --tpu-topology=2x4 \
   6 --num-nodes=1

  2. Kubernetes Manifest (vllm-tpu.yaml)

    1 apiVersion: apps/v1
    2 kind: Deployment
    3 metadata:
    4 name: vllm-tpu
    5 spec:
    6 replicas: 1
    7 selector:
    8 matchLabels:
    9 app: vllm-tpu
   10 template:
   11 metadata:
   12 labels:
   13 app: vllm-tpu
   14 spec:
   15 nodeSelector:
   16 cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
   17 cloud.google.com/gke-tpu-topology: 2x4
   18 containers:
   19 - name: vllm-tpu
   20 image: vllm/vllm-tpu:latest
   21 resources:
   22 limits:
   23 google.com/tpu: "8"
   24 requests:
   25 google.com/tpu: "8"
   26 env:
   27 - name: VLLM_XLA_CACHE_PATH
   28 value: "/tmp/vllm_xla_cache"
   29 command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
   30 args:
   31 - "--model=google/gemma-2-9b-it"
   32 - "--tensor-parallel-size=8"
   33 - "--max-model-len=8192"
   34 ports:
   35 - containerPort: 8000
   36 volumeMounts:
   37 - name: dshm
   38 mountPath: /dev/shm
   39 volumes:
   40 - name: dshm
   41 emptyDir:
   42 medium: Memory
   43 ---
   44 apiVersion: v1
   45 kind: Service
   46 metadata:
   47 name: vllm-tpu-service
   48 spec:
   49 selector:
   50 app: vllm-tpu
   51 ports:
   52 - protocol: TCP
   53 port: 80
   54 targetPort: 8000
   55 type: ClusterIP
  3. Deployment Steps
   1. Save the YAML above to vllm-tpu.yaml.
   2. Apply it: kubectl apply -f vllm-tpu.yaml.
   3. (Optional) If using a private model, ensure a Hugging Face token is provided via secret.
                                                                                                                              ? for shortcuts
Enter fullscreen mode Exit fullscreen mode

Running the Tools Stand-Alone

A sample Python script is also provided to run the tools directly:

xbill@penguin:~/aisprintapr2026/self-hosted-vllm-devops-agent$ python demo_launcher.py 
2026-03-25 20:41:57,541 - vllm-devops-agent - INFO - Initializing DevOps Agent MCP Server...
🚀 TPU Sprint Demo: Self-Hosted vLLM DevOps Agent
============================================================

[Step 1] Analyzing Cloud Logging errors (severity=ERROR)...
2026-03-25 20:42:01,520 - vllm-devops-agent - INFO - Attempting to discover vLLM URL for service: vllm-gemma-2b-it
2026-03-25 20:42:02,432 - vllm-devops-agent - INFO - 📡 Automatically discovered vLLM at: https://vllm-gemma-2b-it-wgcq55zbfq-uk.a.run.app
  ANALYSIS: ### Log Analysis (Self-Hosted vLLM)

- The first log indicates a failed startup of a container due to a missing port for the container.
- The second log indicates a failed startup of a container due...

[Step 2] Proposing remediation for 'MemoryLimitExceeded'...
  REMEDIATION: ### Remediation Plan

**Step 1: Investigate the OOMKilled error**

* Review the logs for the pod that crashed.
* Check the resource usage of the pod, including memory, CPU, and network.
* Use tools like `kubectl top pods` or `kubectl top pods --all-namespaces` to monitor resource usage.
* Identify any bottlenecks or resource requests that may have caused the OOMKilled error.

**Step 2: Identify the root cause of the OOMKilled error**

* Analyze the logs and resource usage data to identify the specific cause of the OOMKilled error.
* This could be due to a memory leak, a resource starvation issue, or a combination of factors.

**Step 3: Implement a solution to address the root cause**

* Once you have identified the root cause, implement a solution to address it.
* This could involve increasing the available memory for the pod, reducing the resource requests, or optimizing the application code.

**Additional Considerations:**

* It is important to take a proactive approach to prevent OOMKilled errors.
* Consider using resource monitoring tools to continuously monitor resource usage and identify potential bottlenecks.
* Implement a monitoring system to alert on OOMKilled events and take corrective actions.
* Review the application code and ensure that it is efficient and does not allocate more resources than necessary.

[Step 3] Vertex AI Model Garden Instructions...

### 🚀 Transferring gemma-2b-it from Vertex AI Model Garden

To use vLLM with Cloud Storage FUSE without Hugging Face, follow these steps:

1. **Accept Terms:** Go to the Vertex AI Model Garden page for Gemma (https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/335) and click 'Accept' on the license agreement.
2. **Download via Signed URL:** After accepting, the console provides a 'Download' button or a signed URL.
3. **Transfer to GCS:**
   If you have the artifacts locally after downloading from the console, use:
   `gcloud storage cp -r ./model_artifacts/* gs://aisprint-491218-bucket/gemma-2b-it/`

4. **Alternative (Direct GCS Copy):**
   Google occasionally provides a managed GCS path for verified projects. If accessible, you can use:
   `gcloud storage cp -r gs://vertex-ai-models/gemma/gemma-2b-it/* gs://aisprint-491218-bucket/gemma-2b-it/`

Once the artifacts are in your bucket, use `get_vllm_deployment_config` to generate your Cloud Run deployment command.

[Step 4] Generating Cloud Run GPU Deployment Config (with GCS FUSE)...
  COMMAND: gcloud beta run deploy vllm-sre-agent --image=vllm/vllm-openai:latest --gpu=1 --gpu-type=nvidia-l4 --no-gpu-zonal-redundancy --no-cpu-throttling --concurrency=4 --timeout=3600 --startup-probe=timeoutSeconds=60,periodSeconds=60,failureThreshold=10,initialDelaySeconds=180,httpGet.port=8000,httpGet.path=/health --max-instances=1 --min-instances=0 --port=8000 --memory=16Gi --cpu=4 --execution-environment=gen2 --add-volume=name=model-volume,type=cloud-storage,bucket=my-gemma-bucket,readonly=true --add-volume-mount=volume=model-volume,mount-path=/mnt/models --args=--model=/mnt/models/gemma-2b-it,--max-model-len=4096,--trust-remote-code,--gpu-memory-utilization=0.9,--host=0.0.0.0 --no-allow-unauthenticated --region=us-east4

[Step 5] Listing available Vertex AI Models (using ADK/SDK)...
  No models found in Vertex AI Model Registry....

[Step 6] Reading MCP Resource (vLLM Deployment Template)...
  TEMPLATE (first 100 chars): # Cloud Run vLLM Deployment Template
# Required: Second Generation execution environment
# Required:...

============================================================
 DevOps Agent Demo Complete: Self-hosted SRE intelligence ready!
xbill@penguin:~/aisprintapr2026/self-hosted-vllm-devops-agent$
Enter fullscreen mode Exit fullscreen mode

One Last Review with Gemini CLI

 > summarize the project
✦ This project is an MCP (Model Context Protocol) Server designed as an automated DevOps/SRE assistant. It leverages self-hosted Gemma models
  running on vLLM (Cloud Run GPU) to analyze Google Cloud infrastructure logs and provide actionable remediation strategies.

  Core Capabilities
   * Log Intelligence: Automatically fetches and summarizes Cloud Logging errors (severity=ERROR) to identify root causes using private LLM
     inference.
   * Incident Response: Generates expert 3-step remediation plans for specific SRE incidents (e.g., OOMKilled, connectivity issues).
   * Infrastructure Lifecycle: Provides tools to deploy, scale, and destroy vLLM inference stacks on Cloud Run using NVIDIA L4 GPUs and GCS
     FUSE.
   * Model Portability: Includes automated workflows to transfer Gemma weights from Vertex AI Model Garden or Kaggle to private GCS buckets
     for use with vLLM.

  Technical Architecture
   * Inference Stack: vLLM running on Cloud Run (gen2) with 16GiB RAM and 4 vCPUs. It uses GCS FUSE to mount model weights directly from Cloud
     Storage.
   * Agent Logic: Built with FastMCP, integrating google-cloud-logging for telemetry and the Vertex AI SDK (ADK) for model registry
     management.
   * Automation: A comprehensive Makefile handles environment setup, GPU deployment, and interactive model querying.

  The system is optimized for Private AI Operations, ensuring that infrastructure logs are analyzed by a private model endpoint rather than
  being sent to a public API.
Enter fullscreen mode Exit fullscreen mode

Summary

The strategy for using MCP for vLLM Devops across Cloud Run and TPU with Gemini CLI was validated with a incremental step by step approach.

A minimal stdio transport MCP Server was started from Python source code and validated with Gemini CLI running as a MCP client in the same local environment. This Python server provided all of the management tools to deploy and troubleshoot vLLM deployments.

This approach can be extended to more complex deployments using other MCP transports and Cloud based options.


Top comments (0)